Chap 2 DL
Chap 2 DL
Deep Learning
Artificial Intelligence has come a long way and has been seamlessly bridging the gap between the potential of humans and
machines. And data enthusiasts all around the globe work on numerous aspects of AI and turn visions into reality - and one
such amazing area is the domain of Computer Vision. This field aims to enable and configure machines to view the world as
humans do, and use the knowledge for several tasks and processes (such as Image Recognition, Image Analysis and
Classification, and so on). And the advancements in Computer Vision with Deep Learning have been a considerable success,
particularly with the Convolutional Neural Network algorithm.
Introduction to CNN
Yann LeCun, director of Facebook’s AI Research Group, pioneered convolutional neural networks. In 1988, he built the first
one, LeNet, which was used for character recognition tasks like reading zip codes and digits.
Have you ever wondered how facial recognition works on social media, or how object detection helps in building self-
driving cars, or how disease detection is done using visual imagery in healthcare? It’s all possible thanks to convolutional
neural networks (CNN). Here’s an example of convolutional neural networks that illustrates how they work:
Imagine there’s an image of a bird, and you want to identify whether it’s really a bird or some other object. The first thing
you do is feed the pixels of the image in the form of arrays to the input layer of the neural network (multi-layer networks
used to classify things). The hidden layers carry out feature extraction by performing different calculations and
manipulations. There are multiple hidden layers like the convolution layer, the ReLU layer, and pooling layer, that perform
feature extraction from the image. Finally, there’s a fully connected layer that identifies the object in the image.
A convolutional neural network is a feed-forward neural network that is generally used to analyze visual images by
processing data with grid-like topology. It’s also known as a ConvNet. A convolutional neural network is used to detect and
classify objects in an image.
Below is a neural network that identifies two types of flowers: Orchid and Rose.
a = [5,3,7,5,9,7]
b = [1,2,3]
In convolution operation, the arrays are multiplied element-wise, and the product is summed to create a new array, which
represents a*b.
The first three elements of the matrix a are multiplied with the elements of matrix b. The product is summed to get the result.
The next three elements from the matrix a are multiplied by the elements in matrix b, and the product is summed up.
The boxes that are colored represent a pixel value of 1, and 0 if not colored.
When you press backslash (\), the below image gets processed.
When you press forward-slash (/), the below image is processed
As you can see from the above diagram, only those values are lit that have a value of 1.
A convolution neural network has multiple hidden layers that help in extracting information from an image. The four
important layers in CNN are:
1. Convolution layer
2. ReLU layer
3. Pooling layer
6. Flattening
7. Output Layer
Convolution Layer
This is the first step in the process of extracting valuable features from an image. A convolution layer has several filters that
perform the convolution operation. Every image is considered as a matrix of pixel values.
Consider the following 5x5 image whose pixel values are either 0 or 1. There’s also a filter matrix with a dimension of 3x3.
Slide the filter matrix over the image and compute the dot product to get the convolved feature matrix.
ReLU layer
ReLU stands for the rectified linear unit. Once the feature maps are extracted, the next step is to move them to a ReLU
layer.
ReLU performs an element-wise operation and sets all the negative pixels to 0. It introduces non-linearity to the network,
and the generated output is a rectified feature map. Below is the graph of a ReLU function:
The original image is scanned with multiple convolutions and ReLU layers for locating the features.
Pooling Layer
Pooling is a down-sampling operation that reduces the dimensionality of the feature map. The rectified feature map now
goes through a pooling layer to generate a pooled feature map.
The pooling layer uses various filters to identify different parts of the image like edges, corners, body, feathers, eyes, and
beak.
Here’s how the structure of the convolution neural network looks so far:
The next step in the process is called flattening. Flattening is used to convert all the resultant 2-Dimensional arrays from
pooled feature maps into a single long continuous linear vector.
The flattened matrix is fed as input to the fully connected layer to classify the image.
Here’s how exactly CNN recognizes a bird:
The pixels from the image are fed to the convolutional layer that performs the convolution operation
The convolved map is applied to a ReLU function to generate a rectified feature map
The image is processed with multiple convolutions and ReLU layers for locating the features
Different pooling layers with various filters are used to identify specific parts of the image
The pooled feature map is flattened and fed to a fully connected layer to get the final output
Activation Layer
The activation layer introduces nonlinearity into the network by applying an activation function to the output of the previous
layer. This is crucial for the network to learn complex patterns. Common activation functions, such as ReLU, Tanh, and
Leaky ReLU, transform the input while keeping the output size unchanged.
Flattening
After the convolution and pooling operations, the feature maps still exist in a multi-dimensional format. Flattening converts
these feature maps into a one-dimensional vector. This process is essential because it prepares the data to be passed into fully
connected layers for classification or regression tasks.
Output Layer
In the output layer, the final result from the fully connected layers is processed through a logistic function, such as sigmoid
or softmax. These functions convert the raw scores into probability distributions, enabling the model to predict the most
likely class label.
Introduction to Convolution Neural Network: A Convolutional Neural Network (CNN) is a type of Deep Learning
neural network architecture commonly used in Computer Vision. Computer vision is a field of Artificial Intelligence that
enables a computer to understand and interpret the image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really well. Neural Networks are used in various
datasets like images, audio, and text. Different types of Neural Networks are used for different purposes, for example for
predicting the sequence of words we use Recurrent Neural Networks more precisely an LSTM, similarly for image
classification we use Convolution Neural networks. In this blog, we are going to build a basic building block for CNN.
1. Input Layers: It’s the layer in which we give input to our model. The number of neurons in this layer is equal to
the total number of features in our data (number of pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then fed into the hidden layer. There can be many hidden layers
depending on our model and data size. Each hidden layer can have different numbers of neurons which are
generally greater than the number of features. The output from each layer is computed by matrix multiplication of
the output of the previous layer with learnable weights of that layer and then by the addition of learnable biases
followed by activation function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function like sigmoid or softmax which
converts the output of each class into the probability score of each class.
The data is fed into the model and output from each layer is obtained from the above step is called feedforward, we then
calculate the error using an error function, some common error functions are cross-entropy, square loss error, etc. The error
function measures how well the network is performing. After that, we backpropagate into the model by calculating the
derivatives. This step is called Backpropagation which basically is used to minimize the loss.
Convolutional Neural Network (CNN) is the extended version of artificial neural networks (ANN) which is predominantly
used to extract the feature from the grid-like matrix dataset. For example visual datasets like images or videos where data
patterns play an extensive role.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer, Convolutional layer, Pooling layer, and fully
connected layers.
The Convolutional layer applies filters to the input image to extract features, the Pooling layer downsamples the image to
reduce computation, and the fully connected layer makes the final prediction. The network learns the optimal filters through
backpropagation and gradient descent.
Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine you have an image. It can
be represented as a cuboid having its length, width (dimension of the image), and height (i.e the channel as images generally
have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network, called a filter or kernel on it, with say,
K outputs and representing them vertically. Now slide that neural network across the whole image, as a result, we will get
another image with different widths, heights, and depths. Instead of just R, G, and B channels now we have more channels
but lesser width and height. This operation is called Convolution. If the patch size is the same as that of the image it will be
a regular neural network. Because of this small patch, we have fewer weights.
Image source: Deep Learning Udacity
Now let’s talk about a bit of mathematics that is involved in the whole convolution process.
Convolution layers consist of a set of learnable filters (or kernels) having small widths and heights and the same
depth as that of input volume (3 if the input layer is image input).
For example, if we have to run convolution on an image with dimensions 34x34x3. The possible size of filters can
be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to the image dimension.
During the forward pass, we slide each filter across the whole input volume step by step where each step is
called stride (which can have a value of 2, 3, or even 4 for high-dimensional images) and compute the dot product
between the kernel weights and patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together as a result, we’ll get
output volume having a depth equal to the number of filters. The network will learn all the filters.
A complete Convolution Neural Networks architecture is also known as covnets. A covnets is a sequence of layers, and every
layer transforms one volume to another through a differentiable function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input will be an image or
a sequence of images. This layer holds the raw input of the image with width 32, height 32, and depth 3.
Convolutional Layers: This is the layer, which is used to extract the feature from the input dataset. It applies a set
of learnable filters known as the kernels to the input images. The filters/kernels are smaller matrices usually 2×2,
3×3, or 5×5 shape. it slides over the input image data and computes the dot product between kernel weight and the
corresponding input image patch. The output of this layer is referred as feature maps. Suppose we use a total of 12
filters for this layer we’ll get an output volume of dimension 32 x 32 x 12.
Activation Layer: By adding an activation function to the output of the preceding layer, activation layers add
nonlinearity to the network. it will apply an element-wise activation function to the output of the convolution layer.
Some common activation functions are RELU: max(0, x), Tanh, Leaky RELU, etc. The volume remains
unchanged hence output volume will have dimensions 32 x 32 x 12.
Pooling layer: This layer is periodically inserted in the covnets and its main function is to reduce the size of
volume which makes the computation fast reduces memory and also prevents overfitting. Two common types of
pooling layers are max pooling and average pooling. If we use a max pool with 2 x 2 filters and stride 2, the
resultant volume will be of dimension 16x16x12.
Image source: cs231n.stanford.edu
Flattening: The resulting feature maps are flattened into a one-dimensional vector after the convolution and
pooling layers so they can be passed into a completely linked layer for categorization or regression.
Fully Connected Layers: It takes the input from the previous layer and computes the final classification or
regression task.
Output Layer: The output from the fully connected layers is then fed into a logistic function for classification
tasks like sigmoid or softmax which converts the output of each class into the probability score of each class.
Introduction
In the last few years of the IT industry, there has been a huge demand for once particular skill set known as Deep Learning.
Deep Learning a subset of Machine Learning which consists of algorithms that are inspired by the functioning of the human
brain or the neural networks.
These structures are called as Neural Networks. It teaches the computer to do what naturally comes to humans. Deep
learning, there are several types of models such as the Artificial Neural Networks (ANN), Autoencoders, Recurrent Neural
Networks (RNN) and Reinforcement Learning. But there has been one particular model that has contributed a lot in the field
of computer vision and image analysis which is the Convolutional Neural Networks (CNN) or the ConvNets.
CNN is very useful as it minimises human effort by automatically detecting the features. For example, for apples and
mangoes, it would automatically detect the distinct features of each class on its own.
CNNs are a class of Deep Neural Networks that can recognize and classify particular features from images and are widely
used for analyzing visual images. Their applications range from image and video recognition, image classification, medical
image analysis, computer vision and natural language processing.
CNN has high accuracy, and because of the same, it is useful in image recognition. Image recognition has a wide range of
uses in various industries such as medical image analysis, phone, security, recommendation systems, etc.
The term ‘Convolution” in CNN denotes the mathematical function of convolution which is a special kind of linear
operation wherein two functions are multiplied to produce a third function which expresses how the shape of one function is
modified by the other. In simple terms, two images which can be represented as matrices are multiplied to give an output that
is used to extract features from the image.
Learn Machine Learning online from the World’s top Universities – Masters, Executive Post Graduate Programs, and
Advanced Certificate Program in ML & AI to fast-track your career.
Convolutional Neural Networks (CNNs) are deep learning models that extract features from images using convolutional
layers, followed by pooling and fully connected layers for tasks like image classification. They excel in capturing spatial
hierarchies and patterns, making them ideal for analyzing visual data.
A convolution tool that separates and identifies the various features of the image for analysis in a process called as
Feature Extraction.
The network of feature extraction consists of many pairs of convolutional or pooling layers.
A fully connected layer that utilizes the output from the convolution process and predicts the class of the image
based on the features extracted in previous stages.
This CNN model of feature extraction aims to reduce the number of features present in a dataset. It creates new
features which summarises the existing features contained in an original set of features. There are many CNN
layers as shown in the basic CNN architecture with diagram.
Source
There are three types of CNN architecture which are the convolutional layers, pooling layers, and fully-connected (FC)
layers. When these layers are stacked, a CNN architecture will be formed. In addition to these three layers, there are two
more important parameters which are the dropout layer and the activation function which are defined below.
1. Convolutional Layer
This layer is the first layer that is used to extract the various features from the input images. In this layer, the mathematical
operation of convolution is performed between the input image and a filter of a particular size MxM. By sliding the filter
over the input image, the dot product is taken between the filter and the parts of the input image with respect to the size of
the filter (MxM).
The output is termed as the Feature map which gives us information about the image such as the corners and edges. Later,
this feature map is fed to other layers to learn several other features of the input image.
The convolution layer in CNN passes the result to the next layer once applying the convolution operation in the input.
Convolutional layers in CNN benefit a lot as they ensure the spatial relationship between the pixels is intact.
2. Pooling Layer
In most cases, a Convolutional Layer is followed by a Pooling Layer. The primary aim of this layer is to decrease the size of
the convolved feature map to reduce the computational costs. This is performed by decreasing the connections between
layers and independently operates on each feature map. Depending upon method used, there are several types of Pooling
operations. It basically summarises the features generated by a convolution layer.
In Max Pooling, the largest element is taken from feature map. Average Pooling calculates the average of the elements in a
predefined sized Image section. The total sum of the elements in the predefined section is computed in Sum Pooling. The
Pooling Layer usually serves as a bridge between the Convolutional Layer and the FC Layer.
This CNN model generalises the features extracted by the convolution layer, and helps the networks to recognise the features
independently. With the help of this, the computations are also reduced in a network.
The Fully Connected (FC) layer consists of the weights and biases along with the neurons and is used to connect the neurons
between two different layers. These layers are usually placed before the output layer and form the last few layers of a CNN
Architecture.
In this, the input image from the previous layers are flattened and fed to the FC layer. The flattened vector then undergoes
few more FC layers where the mathematical functions operations usually take place. In this stage, the classification process
begins to take place. The reason two layers are connected is that two fully connected layers will perform better than a single
connected layer. These layers in CNN reduce the human supervision
4. Dropout
Usually, when all the features are connected to the FC layer, it can cause overfitting in the training dataset. Overfitting occurs
when a particular model works so well on the training data causing a negative impact in the model’s performance when used
on a new data.
To overcome this problem, a dropout layer is utilised wherein a few neurons are dropped from the neural network during
training process resulting in reduced size of the model. On passing a dropout of 0.3, 30% of the nodes are dropped out
randomly from the neural network.
Dropout results in improving the performance of a machine learning model as it prevents overfitting by making the network
simpler. It drops neurons from the neural networks during training.
5. Activation Functions
Finally, one of the most important parameters of the CNN model is the activation function. They are used to learn and
approximate any kind of continuous and complex relationship between variables of the network. In simple words, it decides
which information of the model should fire in the forward direction and which ones should not at the end of the network.
It adds non-linearity to the network. There are several commonly used activation functions such as the ReLU, Softmax, tanH
and the Sigmoid functions. Each of these functions have a specific usage. For a binary classification CNN model, sigmoid
and softmax functions are preferred an for a multi-class classification, generally softmax us used. In simple terms, activation
functions in a CNN model determine whether a neuron should be activated or not. It decides whether the input to the work is
important or not to predict using mathematical operations.
ReLU (Rectified Linear Unit) is a popular activation function used in Convolutional Neural Networks (CNNs). It introduces
non-linearity by outputting the input directly if it’s positive and zero otherwise, helping models to learn complex patterns.
In 1998, the LeNet-5 architecture was introduced in a research paper titled “Gradient-Based Learning Applied to Document
Recognition” by Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. It is one of the earliest and most basic
CNN architecture.
It consists of 7 layers. The first layer consists of an input image with dimensions of 32×32. It is convolved with 6 filters of
size 5×5 resulting in dimension of 28x28x6. The second layer is a Pooling operation which filter size 2×2 and stride of 2.
Hence the resulting image dimension will be 14x14x6.
Similarly, the third layer also involves in a convolution operation with 16 filters of size 5×5 followed by a fourth pooling
layer with similar filter size of 2×2 and stride of 2. Thus, the resulting image dimension will be reduced to 5x5x16.
Once the image dimension is reduced, the fifth layer is a fully connected convolutional layer with 120 filters each of size
5×5. In this layer, each of the 120 units in this layer will be connected to the 400 (5x5x16) units from the previous layers.
The sixth layer is also a fully connected layer with 84 units.
The final seventh layer will be a softmax output layer with ‘n’ possible classes depending upon the number of classes in the
dataset.
The above diagram is a representation of the 7 layers of the LeNet-5 CNN Architecture.
Convolutional Layer
1. All
2. /
Convolutional Layer
In deep learning, a convolutional neural network (CNN or ConvNet) is a class of deep neural networks, that are typically
used to recognize patterns present in images but they are also used for spatial data analysis, computer vision, natural
language processing, signal processing, and various other purposes The architecture of a Convolutional Network resembles
the connectivity pattern of neurons in the Human Brain and was inspired by the organization of the Visual Cortex. This
specific type of Artificial Neural Network gets its name from one of the most important operations in the network:
convolution.
What Is a Convolution?
Convolution is an orderly procedure where two sources of information are intertwined; it’s an operation that changes a
function into something else. Convolutions have been used for a long time typically in image processing to blur and sharpen
images, but also to perform other operations. (e.g. enhance edges and emboss) CNNs enforce a local connectivity pattern
between neurons of adjacent layers.
CNNs make use of filters (also known as kernels), to detect what features, such as edges, are present throughout an image.
There are four main operations in a CNN:
Convolution
The first layer of a Convolutional Neural Network is always a Convolutional Layer. Convolutional layers apply a
convolution operation to the input, passing the result to the next layer. A convolution converts all the pixels in its receptive
field into a single value. For example, if you would apply a convolution to an image, you will be decreasing the image size
as well as bringing all the information in the field together into a single pixel. The final output of the convolutional layer is a
vector. Based on the type of problem we need to solve and on the kind of features we are looking to learn, we can use
different kinds of convolutions.
The most common type of convolution that is used is the 2D convolution layer and is usually abbreviated as conv2D. A filter
or a kernel in a conv2D layer “slides” over the 2D input data, performing an elementwise multiplication. As a result, it will
be summing up the results into a single output pixel. The kernel will perform the same operation for every location it slides
over, transforming a 2D matrix of features into a different 2D matrix of features.
Separable Convolutions
There are two main types of separable convolutions: spatial separable convolutions, and depthwise separable convolutions.
The spatial separable convolution deals primarily with the spatial dimensions of an image and kernel: the width and the
height. Compared to spatial separable convolutions, depthwise separable convolutions work with kernels that cannot be
“factored” into two smaller kernels. As a result, it is more frequently used.
Transposed Convolutions
These types of convolutions are also known as deconvolutions or fractionally strided convolutions. A transposed
convolutional layer carries out a regular convolution but reverts its spatial transformation.
The pooling operation involves sliding a two-dimensional filter over each channel of feature map and summarizing the
features lying within the region covered by the filter.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a pooling layer is
where,
A common CNN model architecture is to have a number of convolution and pooling layers stacked one after the other.
Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to
learn and the amount of computation performed in the network.
The pooling layer summarises the features present in a region of the feature map generated by a convolution layer.
So, further operations are performed on summarised features instead of precisely positioned features generated by
the convolution layer. This makes the model more robust to variations in the position of the features in the input
image.
Max Pooling
1. Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered
by the filter. Thus, the output after max-pooling layer would be a feature map containing the most prominent
features of the previous feature map.
Average Pooling
1. Average pooling computes the average of the elements present in the region of feature map covered by the filter.
Thus, while max pooling gives the most prominent feature in a particular patch of the feature map, average pooling
gives the average of features present in a patch.
Global Pooling
1. Global pooling reduces each channel in the feature map to a single value. Thus, an nh x nw x nc feature map is
reduced to 1 x 1 x nc feature map. This is equivalent to using a filter of dimensions nh x nw i.e. the dimensions of
the feature map.
Further, it can be either global max pooling or global average pooling.
In convolutional neural networks (CNNs), the pooling layer is a common type of layer that is typically added after
convolutional layers. The pooling layer is used to reduce the spatial dimensions (i.e., the width and height) of the feature
maps, while preserving the depth (i.e., the number of channels).
1. The pooling layer works by dividing the input feature map into a set of non-overlapping regions, called pooling
regions. Each pooling region is then transformed into a single output value, which represents the presence of a
particular feature in that region. The most common types of pooling operations are max pooling and average
pooling.
2. In max pooling, the output value for each pooling region is simply the maximum value of the input values within
that region. This has the effect of preserving the most salient features in each pooling region, while discarding less
relevant information. Max pooling is often used in CNNs for object recognition tasks, as it helps to identify the
most distinctive features of an object, such as its edges and corners.
3. In average pooling, the output value for each pooling region is the average of the input values within that region.
This has the effect of preserving more information than max pooling, but may also dilute the most salient features.
Average pooling is often used in CNNs for tasks such as image segmentation and object detection, where a more
fine-grained representation of the input is required.
Pooling layers are typically used in conjunction with convolutional layers in a CNN, with each pooling layer reducing the
spatial dimensions of the feature maps, while the convolutional layers extract increasingly complex features from the input.
The resulting feature maps are then passed to a fully connected layer, which performs the final classification or regression
task.
1. Dimensionality reduction: The main advantage of pooling layers is that they help in reducing the spatial
dimensions of the feature maps. This reduces the computational cost and also helps in avoiding overfitting by
reducing the number of parameters in the model.
2. Translation invariance: Pooling layers are also useful in achieving translation invariance in the feature maps. This
means that the position of an object in the image does not affect the classification result, as the same features are
detected regardless of the position of the object.
3. Feature selection: Pooling layers can also help in selecting the most important features from the input, as max
pooling selects the most salient features and average pooling preserves more information.
1. Information loss: One of the main disadvantages of pooling layers is that they discard some information from the
input feature maps, which can be important for the final classification or regression task.
2. Over-smoothing: Pooling layers can also cause over-smoothing of the feature maps, which can result in the loss of
some fine-grained details that are important for the final classification or regression task.
3. Hyperparameter tuning: Pooling layers also introduce hyperparameters such as the size of the pooling regions and
the stride, which need to be tuned in order to achieve optimal performance. This can be time-consuming and
requires some expertise in model building.
Variants of the Basic Convolution Function
Convolution in the context of NN means an operation that consists of many applications of convolution in parallel.
Kernel K with element Ki,j,k,lKi,j,k,l giving the connection strength between a unit in channel i of output and a
unit in channel j of the input, with an offset of k rows and l columns between the output unit and the input unit.
Full Convolution
0 Padding 1 stride
Zi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,nZi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,n
0 Padding s stride
Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1)+nKi,l,m,n]Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1)+nKi,l,m,
n]
Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by downsampling:
Without 0 paddings, the width of representation shrinks by one pixel less than the kernel width at each layer. We are forced
to choose between shrinking the spatial extent of the network rapidly and using small kernel. 0 padding allows us to control
the kernel width and the size of the output independently.
Special case of 0 padding:
Same: keep the size of the output to the size of input. Unlimited number of layers. Pixels near the border influence
fewer output pixels than the input pixels near the center.
Full: Enough zeros are added for every pixels to be visited k (kernel width) times in each direction, resulting width
m + k - 1. Difficult to learn a single kernel that performs well at all positions in the convolutional feature map.
Usually the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’
Unshared Convolution
In some case when we do not want to use convolution but want to use locally connected layer. We use Unshared
convolution. Indices into weight W
Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]
Useful when we know that each feature should be a function of a small part of space, but no reason to think that the same
feature should occur accross all the space. eg: look for mouth only in the bottom half of the image.
It can be also useful to make versions of convolution or local connected layers in which the connectivity is further restricted,
eg: constrain each output channeel i to be a function of only a subset of the input channel.
Adv: * reduce memory consumption * increase statistical efficiency * reduce computation for both forward and backward
prop.
Tiled Convolution
Learn a set of kernels that we rotate through as we move through space. Immediately neighboring locations will have
different filters, but the memory requirement for storing the parameters will increase by a factor of the size of this set of
kernels. Comparison on locally connected layers, tiled convolution and stardard convolution:
K: 6-D tensor, t different choice of kernel stack
Zi,j,k=∑l,m,n[Vi,i+m−1,j+n−1Ki,l,m,n,j%t+1,k%t+1]Zi,j,k=∑l,m,n[Vi,i+m−1,j+n−1Ki,l,m,n,j%t+1,k%t+1]
Local connected layers and tiled convolutional layer with max pooling: the detector units of these layers are driven by
different filters. If the filters learn to detect different tranformed version of the same underlying features, then the max-
pooled units become invariant to the learned transformation.
Review:
Back prop in conv layer
K: Kernel stack
V: Input image
G: gradient on Z
Bias in after conv
We generally add some bias term to each output before applying nonelinearity.
For local conncted layers: give each unit its one bias
For tiled conv layers: share the biases with the same tiling pattern as the kernels
For conv layers: have one bias per channel of the output and share it accross all locations within each convolution
map. If the input is fixed size, it is also possible to learn a seperate bias at each location of the output map.
Convolution in the context of NN means an operation that consists of many applications of convolution in parallel.
Kernel K with element Ki,j,k,lKi,j,k,l giving the connection strength between a unit in channel i of output and a
unit in channel j of the input, with an offset of k rows and l columns between the output unit and the input unit.
Full Convolution
0 Padding 1 stride
Zi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,nZi,j,k=∑l,m,nVl,j+m−1,k+n−1Ki,l,m,n
0 Padding s stride
Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1)+nKi,l,m,n]Zi,j,k=c(K,V,s)i,j,k=∑l,m,n[Vl,s∗(j−1)+m,s∗(k−1)+nKi,l,m,
n]
Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by downsampling:
Same: keep the size of the output to the size of input. Unlimited number of layers. Pixels near the border influence
fewer output pixels than the input pixels near the center.
Full: Enough zeros are added for every pixels to be visited k (kernel width) times in each direction, resulting width
m + k - 1. Difficult to learn a single kernel that performs well at all positions in the convolutional feature map.
Usually the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’
Unshared Convolution
In some case when we do not want to use convolution but want to use locally connected layer. We use Unshared
convolution. Indices into weight W
Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]Zi,j,k=∑l,m,n[Vl,i+m−1,j+n−1Wi,j,k,l,m,n]
Useful when we know that each feature should be a function of a small part of space, but no reason to think that the same
feature should occur accross all the space. eg: look for mouth only in the bottom half of the image.
It can be also useful to make versions of convolution or local connected layers in which the connectivity is further restricted,
eg: constrain each output channeel i to be a function of only a subset of the input channel.
Adv: * reduce memory consumption * increase statistical efficiency * reduce computation for both forward and backward
prop.
Tiled Convolution
Learn a set of kernels that we rotate through as we move through space. Immediately neighboring locations will have
different filters, but the memory requirement for storing the parameters will increase by a factor of the size of this set of
kernels. Comparison on locally connected layers, tiled convolution and stardard convolution:
K: 6-D tensor, t different choice of kernel stack
Zi,j,k=∑l,m,n[Vi,i+m−1,j+n−1Ki,l,m,n,j%t+1,k%t+1]Zi,j,k=∑l,m,n[Vi,i+m−1,j+n−1Ki,l,m,n,j%t+1,k%t+1]
Local connected layers and tiled convolutional layer with max pooling: the detector units of these layers are driven by
different filters. If the filters learn to detect different tranformed version of the same underlying features, then the max-
pooled units become invariant to the learned transformation.
Review:
Back prop in conv layer
K: Kernel stack
V: Input image
G: gradient on Z
Bias in after conv
We generally add some bias term to each output before applying nonelinearity.
For local conncted layers: give each unit its one bias
For tiled conv layers: share the biases with the same tiling pattern as the kernels
For conv layers: have one bias per channel of the output and share it accross all locations within each convolution
map. If the input is fixed size, it is also possible to learn a seperate bias at each location of the output map.
Colab [pytorch]
So far, we have trained our models with minibatch stochastic gradient descent. However, when we implemented the
algorithm, we only worried about the calculations involved in forward propagation through the model. When it came time to
calculate the gradients, we just invoked the backpropagation function provided by the deep learning framework.
The automatic calculation of gradients profoundly simplifies the implementation of deep learning algorithms. Before
automatic differentiation, even small changes to complicated models required recalculating complicated derivatives by hand.
Surprisingly often, academic papers had to allocate numerous pages to deriving update rules. While we must continue to rely
on automatic differentiation so we can focus on the interesting parts, you ought to know how these gradients are calculated
under the hood if you want to go beyond a shallow understanding of deep learning.
In this section, we take a deep dive into the details of backward propagation (more commonly called backpropagation). To
convey some insight for both the techniques and their implementations, we rely on some basic mathematics and
computational graphs. To start, we focus our exposition on a one-hidden-layer MLP with weight decay (ℓ2 regularization, to
Forward propagation (or forward pass) refers to the calculation and storage of intermediate variables (including outputs) for
a neural network in order from the input layer to the output layer. We now work step-by-step through the mechanics of a
neural network with one hidden layer. This may seem tedious but in the eternal words of funk virtuoso James Brown, you
For the sake of simplicity, let’s assume that the input example is x∈Rd and that our hidden layer does not include a bias
(5.3.1)z=W1x,
where W(1)∈Rh×d is the weight parameter of the hidden layer. After running the intermediate variable z∈Rh through the
(5.3.2)h=ϕ(z).
The hidden layer output h is also an intermediate variable. Assuming that the parameters of the output layer possess only a
weight of W(2)∈Rq×h, we can obtain an output layer variable with a vector of length q:
(5.3.3)o=W2h.
Assuming that the loss function is l and the example label is y, we can then calculate the loss term for a single data example,
(5.3.4)L=l(o,y).
As we will see the definition of ℓ2 regularization to be introduced later, given the hyperparameter λ, the regularization term
is
the ℓ2 norm applied after flattening the matrix into a vector. Finally, the model’s regularized loss on a given data example is:
(5.3.6)J=L+s.
Plotting computational graphs helps us visualize the dependencies of operators and variables within the calculation. Fig.
5.3.1 contains the graph associated with the simple network described above, where squares denote variables and circles
denote operators. The lower-left corner signifies the input and the upper-right corner is the output. Notice that the directions
of the arrows (which illustrate data flow) are primarily rightward and upward.
5.3.3. Backpropagation
Backpropagation refers to the method of calculating the gradient of neural network parameters. In short, the method
traverses the network in reverse order, from the output to the input layer, according to the chain rule from calculus. The
algorithm stores any intermediate variables (partial derivatives) required while calculating the gradient with respect to some
parameters. Assume that we have functions Y=f(X) and Z=g(Y), in which the input and the output X,Y,Z are tensors of
arbitrary shapes. By using the chain rule, we can compute the derivative of Z with respect to X via
Here we use the prod operator to multiply its arguments after the necessary operations,
such as transposition and swapping input positions, have been carried out. For vectors, this is straightforward: it is simply
matrix–matrix multiplication. For higher dimensional tensors, we use the appropriate counterpart. The operator prod hides
Recall that the parameters of the simple network with one hidden layer, whose computational graph is in Fig. 5.3.1,
are W(1) and W(2). The objective of backpropagation is to calculate the gradients ∂J/∂W(1) and ∂J/∂W(2). To accomplish
this, we apply the chain rule and calculate, in turn, the gradient of each intermediate variable and parameter. The order of
calculations are reversed relative to those performed in forward propagation, since we need to start with the outcome of the
computational graph and work our way towards the parameters. The first step is to calculate the gradients of the objective
function J=L+s with respect to the loss term L and the regularization term s:
When training neural networks, forward and backward propagation depend on each other. In particular, for forward
propagation, we traverse the computational graph in the direction of dependencies and compute all the variables on its path.
These are then used for backpropagation where the compute order on the graph is reversed.
Take the aforementioned simple network as an illustrative example. On the one hand, computing the regularization
term (5.3.5) during forward propagation depends on the current values of model parameters W1 and W2. They are given by
the optimization algorithm according to backpropagation in the most recent iteration. On the other hand, the gradient
calculation for the parameter (5.3.11) during backpropagation depends on the current value of the hidden layer output h,
Therefore when training neural networks, once model parameters are initialized, we alternate forward propagation with
backpropagation, updating model parameters using gradients given by backpropagation. Note that backpropagation reuses
the stored intermediate values from forward propagation to avoid duplicate calculations. One of the consequences is that we
need to retain the intermediate values until backpropagation is complete. This is also one of the reasons why training requires
significantly more memory than plain prediction. Besides, the size of such intermediate values is roughly proportional to the
number of network layers and the batch size. Thus, training deeper networks using larger batch sizes more easily leads
to out-of-memory errors.
5.3.5. Summary
Forward propagation sequentially calculates and stores intermediate variables within the computational graph defined by the
neural network. It proceeds from the input to the output layer. Backpropagation sequentially calculates and stores the
gradients of intermediate variables and parameters within the neural network in the reversed order. When training deep
learning models, forward propagation and backpropagation are interdependent, and training requires significantly more
Building a deep neural network (DNN) is a rewarding project that involves multiple steps, including data preprocessing,
model architecture design, training, and evaluation. Here's a step-by-step guide to help you get started:
- **Select a problem**: You need to define the task you're solving, such as image classification, natural language
- **Choose a dataset**: Depending on the problem, you can select publicly available datasets or use your own. Some good
sources:
- **Kaggle**: Hosts a variety of datasets for machine learning projects.
- **Cleaning**: Remove missing values, normalize or standardize features, and handle categorical variables.
- **Data Augmentation (if using images)**: Apply techniques such as flipping, rotating, and scaling to generate more
samples.
- **Splitting the dataset**: Split your data into training, validation, and test sets.
- **Hidden Layers**: Fully connected (dense) layers between the input and output. You can start with 1-2 hidden layers
- **Output Layer**: The final layer's size depends on the problem (e.g., softmax for classification).
- **Activation Functions**:
- Common choices are **ReLU** for hidden layers and **softmax** or **sigmoid** for the output layer.
- **Loss Function**:
predictions.
- **Hyperparameter Tuning**:
- Adjust learning rate, batch size, number of epochs, and the architecture (e.g., number of layers or neurons).
- You can use techniques like **grid search** or **random search** for tuning.
### 5. **Evaluation**
- Use metrics like **accuracy** for classification, **MSE** for regression, and track the **loss** during training.
- **Save the model**: Use libraries like TensorFlow’s `save_model()` or PyTorch’s `torch.save()` to save your trained
model.
- **Deployment**: You can use frameworks like **Flask** or **FastAPI** to deploy your model as a web service.
- **Matplotlib** and **Seaborn** for visualizing the data and performance metrics.
Would you like help with a specific part of building the DNN, such as writing code, designing the architecture, or evaluating
the model?
In theory, building a deep neural network (DNN) involves understanding the underlying concepts and principles that guide
the design and functioning of the network. Here’s a theoretical breakdown of how a DNN works and how you would go
about building one:
A neural network is a computational model inspired by the way biological neurons in the brain work. It consists of layers of
nodes (neurons) that process input data and generate output.
- **Neuron**: The basic unit of a neural network, a neuron receives input, processes it (using a mathematical function), and
passes the result to the next layer of neurons.
- **Input layer**: The first layer that receives the raw data.
A **deep** neural network simply refers to a network with multiple hidden layers.
\[ z = w \cdot x + b \]
Where:
The network adjusts the weights and biases during training to minimize errors in its predictions.
After computing \( z \), an activation function is applied to introduce non-linearity, allowing the network to learn more
complex patterns.
- **ReLU (Rectified Linear Unit)**: Most commonly used activation function in hidden layers.
- \( f(z) = \text{max}(0, z) \)
- **Sigmoid**: Typically used in binary classification tasks.
- **Softmax**: Used for multi-class classification to output probabilities for each class.
- \( f(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{n}e^{z_j}} \)
- **Input Layer**: Receives the input features (e.g., pixel values in an image, or tabular data).
- **Hidden Layers**: Perform feature extraction and transformation through matrix multiplications, bias addition, and
activation functions.
- **Output Layer**: Outputs the final result (class probabilities or regression values).
The depth of the network (number of hidden layers) allows it to learn hierarchical representations of the data. For instance, in
image classification, the first layer might learn simple edges, the second layer more complex shapes, and so on.
During forward propagation, input data passes through the network, layer by layer, to produce a prediction.
- Input from the previous layer is multiplied by weights, summed with bias, and passed through an activation function.
Once the network makes a prediction, the **loss function** compares the prediction with the actual target to measure how
far off the prediction is. Common loss functions:
#### c. **Backpropagation**
Backpropagation is the process of adjusting the weights in the network to minimize the loss. It works by:
1. Calculating the **gradient** of the loss function with respect to each weight (using the chain rule of calculus).
3. Updating the weights using an optimization algorithm (e.g., Stochastic Gradient Descent, Adam).
- **Learning Rate**: Controls how much to change the weights at each step. Too high, and the network might overshoot the
optimal weights; too low, and the learning process becomes slow.
The training process consists of multiple **epochs**, where the entire dataset passes through the network multiple times.
Each pass helps the network to progressively improve its predictions.
- **Capacity**: The number of neurons and layers determines the capacity of a DNN. A network with more layers and
neurons can model more complex patterns in the data.
- **Overfitting**: If the network has too much capacity, it may memorize the training data and perform poorly on new data.
This is known as overfitting.
- **Regularization**: Techniques like **dropout** (randomly setting neurons to zero during training) or **L2
regularization** (penalizing large weights) can be used to reduce overfitting.
After training the model, it’s important to evaluate its performance using a separate test set (data not seen during training) to
ensure it generalizes well to new data.
- **Hierarchical Learning**: DNNs automatically learn a hierarchy of features from raw input. In tasks like image
recognition, this is incredibly powerful.
- **Universal Approximation**: With enough neurons and layers, DNNs can approximate any continuous function, giving
them immense flexibility in modeling complex data.
- **Vanishing/Exploding Gradients**: In deep networks, gradients can become too small (vanish) or too large (explode)
during backpropagation, making training difficult.
- **Computational Complexity**: Training deep networks requires significant computational power and memory.
- **Data Requirements**: DNNs often require large amounts of labeled data to perform well.
---
This theoretical framework should give you a strong foundation in understanding how deep neural networks operate and
what’s involved in building them. Would you like to dive into any particular area, such as architecture design or optimization
techniques?
Rochak Agrawal
Follow
Published in
8 min read
80
Deep Neural Networks are the solution to complex tasks like Natural Language Processing, Computer Vision, Speech
Synthesis etc. Improving their performance is as important as understanding how they work. To understand how they work,
you can refer to my previous posts. In this post, I will be explaining various terminologies and methods related to improving
the neural networks.
Bias and Variance are two essential terminologies that explain how well the network performs on the Training set and the
Test set. Let us understand Bias and Variance easily and intuitively using a 2 class problem. The blue line indicates the
decision boundary computed by the neural network.
1. The leftmost figure shows that the neural network has the problem of High Bias. In this case, the network has
learned a simple hypothesis and is therefore not able to train properly on the training data. As a result, it is not able
to differentiate between the examples of different classes and will perform poorly on the training set and test set
both. We can also say that the network is Underfitting.
2. The rightmost figure shows that the neural network has the problem of High Variance. In this case, the network
has learned a very complex hypothesis and therefore is not able to generalise. As a result, it will perform great on
training data, whereas poor on the test data. We can also say that the network is Overfitting.
3. The centre figure shows a “Just Right” neural network. It has learned the ideal hypothesis, which helps the
network to filter out the anomalies and also generalise on the data. Our goal should be to achieve such type of
network.
Now that we know what kind of neural network is desirable; let us see how we can achieve our goal. The steps first tackle
the bias problem and then the variance problem.
The first question that we should ask is “Is there a High Bias?” If the answer is YES, then we should try the following steps:
Train a bigger network. It includes increasing the number of hidden layers and the number of neurons in the hidden
layers.
Train the network for an extended period of time. It may be the case that the full training has not been completed
yet and will take more iterations.
Try a different optimisation algorithm. These algorithms include Adam, Momentum, AdaDelta etc.
Perform the above steps iteratively until the bias problem is solved and then move on to the second question.
If the answer is NO, it means that we have overcome the bias problem, and it is time to focus on the variance problem. The
second question that we should ask now is “Is there a High Variance?” If the answer is YES, then we should try the
following steps:
Gather more training data. As we gather more data, we will get more variation in the data, and the complexity of
the learned hypothesis from the less varied data will break.
Perform the above steps iteratively until the variance problem is solved.
If the answer is NO, it means that we have overcome the variance problem, and now our Neural Network is “Just Right”.
Regularization
Regularization is a logical technique which helps to reduce overfitting in a neural network. When we add regularization to
our network, we add a new regularization term, and the loss function is modified. The modified cost function J is
mathematically formulated as:
The second term with lambda is known as the regularization term. The term ||W|| is known as Frobenius Norm (sum of
squares of elements in a matrix). With the inclusion of regularization, lambda becomes a new hyperparameter that can be
modified to improve the performance of the neural network. The above regularization is also known as L-2 regularization.
Since there is a new regularization term in the modified Cost Function J, which includes regularization, we will update the
weights in the following manner:
Here we can see that the Weight value decreases by a small number which is less than 1. Therefore, we also call this type of
regularization as Weight Decay. The decay value depends on the learning rate alpha and the regularization term lambda.
The end goal of training a neural network is to minimize Cost Function J and hence the regularization term. Now that we
know what regularization is, let us try to understand why it works.
The first intuition is that if we increase the value of lambda, the Frobenius Norm becomes small, and the weight values
become close to 0. This methodology mainly wipes out certain neurons making the network a shallow one. It may be thought
of as converting the deep network which learns complex hypothesis into a shallow network which learns simple hypothesis.
As we know that simple hypothesis leads to fewer complex features, the overfitting will be reduced, and we will obtain
a “Just Right” Neural Network.
Another intuition can be gained from the way the activation of a neuron works when regularization is applied. For this, let us
consider tanh(x) activation.
If we increase the value of lambda, then the Frobenius Norm becomes small, i.e. the Weights W become small. Due to this,
the output of that layer will become small and will lie in the blue region of the activation function. As we can see, the
activation of the blue area is almost linear, the network will behave similar to a shallow network, i.e. the network will not
learn complex hypothesis (sharp curves will be avoided) and the overfitting will eventually reduce, and we will obtain
a “Just Right” Neural Network.
Therefore, a too small value of lambda will result in Overfitting as the Frobenius Norm will be large, and neurons will not
be wiped out, and output of the layer will not be in the linear region. Similarly, an excessively large value of lambda will
result in underfitting. Therefore, finding the perfect value of lambda is a crucial task in improving the performance of the
neural network.
Dropout Regularization
Dropout regularization is another regularization technique in which we drop certain neurons along with their connections
present in the neural network. The probability keep_prob determines the neurons that will be dropped. After the neurons are
removed, the network is trained on the remaining neurons. It is important to note that during the test time/ inference time, all
the neurons are taken into consideration for determining the output. Let us try to understand the concept with the help of an
example:
Since we first drop the neurons with the probability keep_prob and then boost the remaining neurons with keep_prob, this
type of Dropout is known as Inverted Dropout.
The intuition between dropout is that it prohibits the neurons from relying only on certain features, and therefore, the weights
are spread out. It may be the case that the neuron becomes dependent on certain input features to determine the output. With
the help of dropout regularization, a particular neuron gets only a few features as input every time for different training
examples during training. Eventually, the weights are spread out amongst all the inputs, and the network uses all the input
features to determine the output and does not rely on any single one, thus making the network more robust. It is also known
as Adaptive form of L2 Regularization.
We can also set keep_prob individually for each layer. Since the number of neurons that are dropped is inversely
proportional to the keep_prob; the general criteria for establishing the keep_prob is that the dense connections should have
relatively less keep_prob so that more neurons are dropped and vice versa.
Another intuition is that with Dropout Regularization, the deep network mimics the working of a shallow network during the
training phase. This, in turn, leads to reducing overfitting, and we obtain a “Just Right” Neural Network.
Early Stopping
Early Stopping is a training methodology in which we stop training the neural network at an earlier stage of time to prevent
it from overfitting. We keep track of train_loss and dev_loss to determine when to stop the training.
Just the dev_loss starts to overshoot; we stop the training process. This methodology is known as Early Stopping. However,
early stopping is not a recommended method for training a network because of the following two reasons:
Early stopping makes things complicated, and we are not able to obtain the “Just Right” Neural Network.
Next
A deep neural network (DNN) is an ANN with multiple hidden layers between the input and output layers. Similar to
shallow ANNs, DNNs can model complex non-linear relationships.
The main purpose of a neural network is to receive a set of inputs, perform progressively complex calculations on them, and
give output to solve real world problems like classification. We restrict ourselves to feed forward neural networks.
Neural networks are widely used in supervised learning and reinforcement learning problems. These networks are based on a
set of layers connected to each other.
In deep learning, the number of hidden layers, mostly non-linear, can be large; say about 1000 layers.
We mostly use the gradient descent method for optimizing the network and minimising the loss function.
We can use the Imagenet, a repository of millions of digital images to classify a dataset into categories like cats and dogs.
DL nets are increasingly used for dynamic images apart from static ones and for time series and text analysis.
Training the data sets forms an important part of Deep Learning models. In addition, Backpropagation is the main algorithm
in training DL models.
DL deals with training large neural networks with complex input output transformations.
One example of DL is the mapping of a photo to the name of the person(s) in photo as they do on social networks and
describing a picture with a phrase is another recent application of DL.
Neural networks are functions that have inputs like x1,x2,x3…that are transformed to outputs like z1,z2,z3 and so on in two
(shallow networks) or several intermediate operations also called layers (deep networks).
The weights and biases change from layer to layer. ‘w’ and ‘v’ are the weights or synapses of layers of the neural networks.
The best use case of deep learning is the supervised learning problem.Here,we have large set of data inputs with a desired set
of outputs.
The most basic data set of deep learning is the MNIST, a dataset of handwritten digits.
We can train deep a Convolutional Neural Network with Keras to classify images of handwritten digits from this dataset.
The firing or activation of a neural net classifier produces a score. For example,to classify patients as sick and healthy,we
consider parameters such as height, weight and body temperature, blood pressure etc.
A high score means patient is sick and a low score means he is healthy.
Each node in output and hidden layers has its own classifiers. The input layer takes inputs and passes on its scores to the next
hidden layer for further activation and this goes on till the output is reached.
This progress from input to output from left to right in the forward direction is called forward propagation.
Credit assignment path (CAP) in a neural network is the series of transformations starting from the input to the output. CAPs
elaborate probable causal connections between the input and the output.
CAP depth for a given feed forward neural network or the CAP depth is the number of hidden layers plus one as the output
layer is included. For recurrent neural networks, where a signal may propagate through a layer several times, the CAP depth
can be potentially limitless.
Basic node in a neural net is a perception mimicking a neuron in a biological neural network. Then we have multi-layered
Perception or MLP. Each set of inputs is modified by a set of weights and biases; each edge has a unique weight and each
node has a unique bias.
The prediction accuracy of a neural net depends on its weights and biases.
The process of improving the accuracy of neural network is called training. The output from a forward prop net is compared
to that value which is known to be correct.
The cost function or the loss function is the difference between the generated output and the actual output.
The point of training is to make the cost of training as small as possible across millions of training examples.To do this, the
network tweaks the weights and biases until the prediction matches the correct output.
Once trained well, a neural net has the potential to make an accurate prediction every time.
When the pattern gets complex and you want your computer to recognise them, you have to go for neural networks.In such
complex pattern scenarios, neural network outperformsall other competing algorithms.
There are now GPUs that can train them faster than ever before. Deep neural networks are already revolutionizing the field
of AI
Computers have proved to be good at performing repetitive calculations and following detailed instructions but have been
not so good at recognising complex patterns.
If there is the problem of recognition of simple patterns, a support vector machine (svm) or a logistic regression classifier
can do the job well, but as the complexity of patternincreases, there is no way but to go for deep neural networks.
Therefore, for complex patterns like a human face, shallow neural networks fail and have no alternative but to go for deep
neural networks with more layers. The deep nets are able to do their job by breaking down the complex patterns into simpler
ones. For example, human face; adeep net would use edges to detect parts like lips, nose, eyes, ears and so on and then re-
combine these together to form a human face
The accuracy of correct prediction has become so accurate that recently at a Google Pattern Recognition Challenge, a deep
net beat a human.
This idea of a web of layered perceptrons has been around for some time; in this area, deep nets mimic the human brain. But
one downside to this is that they take long time to train, a hardware constraint
However recent high performance GPUs have been able to train such deep nets under a week; while fast cpus could have
taken weeks or perhaps months to do the same.
How to choose a deep net? We have to decide if we are building a classifier or if we are trying to find patterns in the data and
if we are going to use unsupervised learning. To extract patterns from a set of unlabelled data, we use a Restricted Boltzman
machine or an Auto encoder.
For text processing, sentiment analysis, parsing and name entity recognition, we use a recurrent net or recursive
neural tensor network or RNTN;
For any language model that operates at character level, we use the recurrent net.
For image recognition, we use deep belief network DBN or convolutional network.
In general, deep belief networks and multilayer perceptrons with rectified linear units or RELU are both good choices for
classification.
For time series analysis, it is always recommended to use recurrent net.
Neural nets have been around for more than 50 years; but only now they have risen into prominence. The reason is that they
are hard to train; when we try to train them with a method called back propagation, we run into a problem called vanishing or
exploding gradients.When that happens, training takes a longer time and accuracy takes a back-seat. When training a data
set, we are constantly calculating the cost function, which is the difference between predicted output and the actual output
from a set of labelled training data.The cost function is then minimized by adjusting the weights and biases values until the
lowest value is obtained. The training process uses a gradient, which is the rate at which the cost will change with respect to
change in weight or bias values.
Learn Python in-depth with real-world projects through our Python certification course. Enroll and become a certified
expert to boost your career.
In 2006, a breakthrough was achieved in tackling the issue of vanishing gradients. Geoff Hinton devised a novel strategy that
led to the development of Restricted Boltzman Machine - RBM, a shallow two layer net.
The first layer is the visible layer and the second layer is the hidden layer. Each node in the visible layer is connected to
every node in the hidden layer. The network is known as restricted as no two layers within the same layer are allowed to
share a connection.
Autoencoders are networks that encode input data as vectors. They create a hidden, or compressed, representation of the raw
data. The vectors are useful in dimensionality reduction; the vector compresses the raw data into smaller number of essential
dimensions. Autoencoders are paired with decoders, which allows the reconstruction of input data based on its hidden
representation.
RBM is the mathematical equivalent of a two-way translator. A forward pass takes inputs and translates them into a set of
numbers that encodes the inputs. A backward pass meanwhile takes this set of numbers and translates them back into
reconstructed inputs. A well-trained net performs back prop with a high degree of accuracy.
In either steps, the weights and the biases have a critical role; they help the RBM in decoding the interrelationships between
the inputs and in deciding which inputs are essential in detecting patterns. Through forward and backward passes, the RBM
is trained to re-construct the input with different weights and biases until the input and there-construction are as close as
possible. An interesting aspect of RBM is that data need not be labelled. This turns out to be very important for real world
data sets like photos, videos, voices and sensor data, all of which tend to be unlabelled. Instead of manually labelling data by
humans, RBM automatically sorts through data; by properly adjusting the weights and biases, an RBM is able to extract
important features and reconstruct the input. RBM is a part of family of feature extractor neural nets, which are designed to
recognize inherent patterns in data. These are also called auto-encoders because they have to encode their own structure.
Deep Belief Networks - DBNs
Deep belief networks (DBNs) are formed by combining RBMs and introducing a clever training method. We have a new
model that finally solves the problem of vanishing gradient. Geoff Hinton invented the RBMs and also Deep Belief Nets as
alternative to back propagation.
A DBN is similar in structure to a MLP (Multi-layer perceptron), but very different when it comes to training. it is the
training that enables DBNs to outperform their shallow counterparts
A DBN can be visualized as a stack of RBMs where the hidden layer of one RBM is the visible layer of the RBM above it.
The first RBM is trained to reconstruct its input as accurately as possible.
The hidden layer of the first RBM is taken as the visible layer of the second RBM and the second RBM is trained using the
outputs from the first RBM. This process is iterated till every layer in the network is trained.
In a DBN, each RBM learns the entire input. A DBN works globally by fine-tuning the entire input in succession as the
model slowly improves like a camera lens slowly focussing a picture. A stack of RBMs outperforms a single RBM as a
multi-layer perceptron MLP outperforms a single perceptron.
At this stage, the RBMs have detected inherent patterns in the data but without any names or label. To finish training of the
DBN, we have to introduce labels to the patterns and fine tune the net with supervised learning.
We need a very small set of labelled samples so that the features and patterns can be associated with a name. This small-
labelled set of data is used for training. This set of labelled data can be very small when compared to the original data set.
The weights and biases are altered slightly, resulting in a small change in the net's perception of the patterns and often a
small increase in the total accuracy.
The training can also be completed in a reasonable amount of time by using GPUs giving very accurate results as compared
to shallow nets and we see a solution to vanishing gradient problem too.
Generative adversarial networks are deep neural nets comprising two nets, pitted one against the other, thus the “adversarial”
name.
GANs were introduced in a paper published by researchers at the University of Montreal in 2014. Facebook’s AI expert Yann
LeCun, referring to GANs, called adversarial training “the most interesting idea in the last 10 years in ML.”
GANs’ potential is huge, as the network-scan learn to mimic any distribution of data. GANs can be taught to create parallel
worlds strikingly similar to our own in any domain: images, music, speech, prose. They are robot artists in a way, and their
output is quite impressive.
In a GAN, one neural network, known as the generator, generates new data instances, while the other, the discriminator,
evaluates them for authenticity.
Let us say we are trying to generate hand-written numerals like those found in the MNIST dataset, which is taken from the
real world. The work of the discriminator, when shown an instance from the true MNIST dataset, is to recognize them as
authentic.
The generator network takes input in the form of random numbers and returns an image.
This generated image is given as input to the discriminator network along with a stream of images taken from the
actual dataset.
The discriminator takes in both real and fake images and returns probabilities, a number between 0 and 1, with 1
representing a prediction of authenticity and 0 representing fake.
o The discriminator is in a feedback loop with the ground truth of the images, which we know.
RNNSare neural networks in which data can flow in any direction. These networks are used for applications such as
language modelling or Natural Language Processing (NLP).
The basic concept underlying RNNs is to utilize sequential information. In a normal neural network it is assumed that all
inputs and outputs are independent of each other. If we want to predict the next word in a sentence we have to know which
words came before it.
RNNs are called recurrent as they repeat the same task for every element of a sequence, with the output being based on the
previous computations. RNNs thus can be said to have a “memory” that captures information about what has been
previously calculated. In theory, RNNs can use information in very long sequences, but in reality, they can look back only a
few steps.
Long short-term memory networks (LSTMs) are most commonly used RNNs.
Together with convolutional Neural Networks, RNNs have been used as part of a model to generate descriptions for
unlabelled images. It is quite amazing how well this seems to work.
If we increase the number of layers in a neural network to make it deeper, it increases the complexity of the network and
allows us to model functions that are more complicated. However, the number of weights and biases will exponentially
increase. As a matter of fact, learning such difficult problems can become impossible for normal neural networks. This leads
to a solution, the convolutional neural networks.
CNNs are extensively used in computer vision; have been applied also in acoustic modelling for automatic speech
recognition.
The idea behind convolutional neural networks is the idea of a “moving filter” which passes through the image. This moving
filter, or convolution, applies to a certain neighbourhood of nodes which for example may be pixels, where the filter applied
is 0.5 x the node value −
Noted researcher Yann LeCun pioneered convolutional neural networks. Facebook as facial recognition software uses these
nets. CNN have been the go to solution for machine vision projects. There are many layers to a convolutional network. In
Imagenet challenge, a machine was able to beat a human at object recognition in 2015.
In a nutshell, Convolutional Neural Networks (CNNs) are multi-layer neural networks. The layers are sometimes up to 17 or
more and assume the input data to be images.
CNNs drastically reduce the number of parameters that need to be tuned. So, CNNs efficiently handle the high
dimensionality of raw images.
Jan 5, 2017
15 minute read
There are certain practices in Deep Learning that are highly recommended, in order to efficiently train Deep Neural
Networks. In this post, I will be covering a few of these most commonly used practices, ranging from importance of quality
training data, choice of hyperparameters to more general tips for faster prototyping of DNNs. Most of these practices, are
validated by the research in academia and industry and are presented with mathematical and experimental proofs in research
papers like Efficient BackProp(Yann LeCun et al.) and Practical Recommendations for Deep Architectures(Yoshua Bengio).
As you’ll notice, I haven’t mentioned any mathematical proofs in this post. All the points suggested here, should be taken
more of a summarization of the best practices for training DNNs. For more in-depth understanding, I highly recommend you
to go through the above mentioned research papers and references provided at the end.
Training data
A lot of ML practitioners are habitual of throwing raw training data in any Deep Neural Net(DNN). And why not, any DNN
would(presumably) still give good results, right? But, it’s not completely old school to say that - “given the right type of
data, a fairly simple model will provide better and faster results than a complex DNN”(although, this might have
exceptions). So, whether you are working with Computer Vision, Natural Language Processing, Statistical Modelling,
etc. try to preprocess your raw data. A few measures one can take to get better training data:
Get your hands on as large a dataset as possible(DNNs are quite data-hungry: more is better)
Remove any training sample with corrupted data(short texts, highly distorted images, spurious output labels,
features with lots of null values, etc.)
Data Augmentation - create new examples(in case of images - rescale, add noise, etc.)
One of the vital components of any Neural Net are activation functions. Activations introduces the much desired non-
linearity into the model. For years, sigmoid activation functions have been the preferable choice. But, a sigmoid function is
inherently cursed by these two drawbacks - 1. Saturation of sigmoids at tails(further causing vanishing gradient problem).
2. sigmoids are not zero-centered.
A better alternative is a tanh function - mathematically, tanh is just a rescaled and shifted sigmoid, tanh(x) = 2*sigmoid(x) -
1. Although tanh can still suffer from the vanishing gradient problem, but the good news is - tanh is zero-centered. Hence,
using tanh as activation function will result into faster convergence. I have found that using tanh as activations generally
works better than sigmoid.
You can further explore other alternatives like ReLU, SoftSign, etc. depending on the specific task, which have shown to
ameliorate some of these issues.
Also, while employing unsupervised pre-trained representations(describe in later sections), the optimal number of hidden
units are generally kept even larger. Since, pre-trained representation might contain a lot of irrelevant information in these
representations(for the specific supervised task). By increasing the number of hidden units, model will have the required
flexibility to filter out the most appropriate information out of these pre-trained representations.
Selecting the optimal number of layers is relatively straight forward. As @Yoshua-Bengio mentioned on Quora - “You just
keep on adding layers, until the test error doesn’t improve anymore”. ;)
Weight Initialization
Always initialize the weights with small random numbers to break the symmetry between different units. But how small
should weights be? What’s the recommended upper limit? What probability distribution to use for generating random
numbers? Furthermore, while using sigmoid activation functions, if weights are initialized to very large numbers, then the
sigmoid will saturate(tail regions), resulting into dead neurons. If weights are very small, then gradients will also be small.
Therefore, it’s preferable to choose weights in an intermediate range, such that these are distributed evenly around a mean
value.
Thankfully, there has been lot of research regarding the appropriate values of initial weights, which is really important for an
efficient convergence. To initialize the weights that are evenly distributed, a uniform distribution is probably one of the best
choice. Furthermore, as shown in the paper(Glorot and Bengio, 2010), units with more incoming connections(fan_in) should
have relatively smaller weights.
Thanks to all these thorough experiments, now we have a tested formula that we can directly use for weight initialization; i.e.
- weights drawn from ~ Uniform(-r, r) where r=sqrt(6/(fan_in+fan_out)) for tanh activations,
and r=4*(sqrt(6/fan_in+fan_out)) for sigmoid activations, where fan_in is the size of the previous layer and fan_out is the
size of next layer.
Learning Rates
This is probably one of the most important hyperparameter, governing the learning process. Set the learning rate too small
and your model might take ages to converge, make it too large and within initial few training examples, your loss might
shoot up to sky. Generally, a learning rate of 0.01 is a safe bet, but this shouldn’t be taken as a stringent rule; since the
optimal learning rate should be in accordance to the specific task.
In contrast to, a fixed learning rate, gradually decreasing the learning rate, after each epoch or after a few thousand examples
is another option. Although this might help in faster training, but requires another manual decision about the new learning
rates. Generally, learning rate can be halved after each epoch - these kinds of strategies were quite common a few years
back.
Fortunately, now we have better momentum based methods to change the learning rate, based on the curvature of the error
function. It might also help to set different learning rates for individual parameters in the model; since, some parameters
might be learning at a relatively slower or faster rate.
Lately, there has been a good amount of research on optimization methods, resulting into adaptive learning rates. At this
moment, we have numerous options starting from good old Momentum Method to Adagrad, Adam(personal
favourite ;)), RMSProp etc. Methods like Adagrad or Adam, effectively save us from manually choosing an initial learning
rate, and given the right amount of time, the model will start to converge quite smoothly(of course, still selecting a good
initial rate will further help).
Grid Search has been prevalent in classical machine learning. But, Grid Search is not at all efficient in finding optimal
hyperparameters for DNNs. Primarily, because of the time taken by a DNN in trying out different hyperparameter
combinations. As the number of hyperparameters keeps on increasing, computation required for Grid Search also increases
exponentially.
1. Based on your prior experience, you can manually tune some common hyperparameters like learning rate, number
of layers, etc.
2. Instead of Grid Search, use Random Search/Random Sampling for choosing optimal hyperparameters. A
combination of hyperparameters is generally choosen from a uniform distribution within the desired range. It is
also possible to add some prior knowledge to further decrease the search space(like learning rate shouldn’t be too
large or too small). Random Search has been found to be way more efficient compared to Grid Search.
Learning Methods
Good old Stochastic Gradient Descent might not be as efficient for DNNs(again, not a stringent rule), lately there have
been a lot of research to develop more flexible optimization algorithms. For e.g.: Adagrad, Adam, AdaDelta, RMSProp, etc.
In addition to providing adaptive learning rates, these sophisticated methods also use different rates for different model
parameters and this generally results into a smoother convergence. It’s good to consider these as hyper-parameters and one
should always try out a few of these on a subset of training data.
Even, when dealing with state-of-the-art Deep Learning Models with latest hardware resources, memory management is
still done at the byte level; So, it’s always good to keep the size of your parameters as 64, 128, 512, 1024(all powers of 2).
This might help in sharding the matrices, weights, etc. resulting into slight boost in learning efficiency. This becomes even
more significant when dealing with GPUs.
Unsupervised Pretraining
Doesn’t matter whether you are working with NLP, Computer Vision, Speech Recognition, etc. Unsupervised
Pretraining always help the training of your supervised or other unsupervised models. Word Vectors in NLP are
ubiquitous; you can use ImageNet dataset to pretrain your model in an unsupervised manner, for a 2-class supervised
classification; or audio samples from a much larger domain to further use that information for a speaker disambiguation
model.
Major objective of training a model is to learn appropriate parameters, that results into an optimal mapping from inputs to
outputs. These parameters are tuned with each training sample, irrespective of your decision to use batch, mini-
batch or stochastic learning. While employing a stochastic learning approach, gradients of weights are tuned after each
training sample, introducing noise into gradients(hence the word ‘stochastic’). This has a very desirable effect; i.e. - with the
introduction of noise during the training, the model becomes less prone to overfitting.
However, going through the stochastic learning approach might be relatively less efficient; since now a days machines have
far more computation power. Stochastic learning might effectively waste a large portion of this. If we are capable of
computing Matrix-Matrix multiplication, then why should we limit ourselves, to iterate through the multiplications of
individual pairs of Vectors? Therefore, for greater throughput/faster learning, it’s recommended to use mini-batches instead
of stochastic learning.
But, selecting an appropriate batch size is equally important; so that we can still retain some noise(by not using a huge batch)
and simultaneously use the computation power of machines more effectively. Commonly, a batch of 16 to 128 examples is a
good choice(exponential of 2). Usually, batch size is selected, once you have already found more important
hyperparameters(by manual search or random search). Nevertheless, there are scenarios when the model is getting the
training data as a stream(online learning), then resorting to Stochastic Learning is a good option.
This comes from Information Theory - “Learning that an unlikely event has occurred is more informative than learning that
a likely event has occurred”. Similarly, randomizing the order of training examples(in different epochs, or mini-batches) will
result in faster convergence. A slight boost is always noticed when the model doesn’t see a lot of examples in the same order.
Considering, millions of parameters to be learned, regularization becomes an imperative requisite to prevent overfitting in
DNNs. You can keep on using L1/L2 regularization as well, but Dropout is preferable to check overfitting in DNNs.
Dropout is trivial to implement and generally results into faster learning. A default value of 0.5 is a good choice, although
this depends on the specific task,. If the model is less complex, then a dropout of 0.2 might also suffice.
Dropout should be turned off, during the test phase, and weights should be scaled accordingly, as done in the original paper.
Just allow a model with Dropout regularization, a little bit more training time; and the error will surely go down.
“Training a Deep Learning Model for multiple epochs will result in a better model” - we have heard it a couple of times, but
how do we quantify “many”? Turns out, there is a simple strategy for this - Just keep on training your model for a fixed
amount of examples/epochs, let’s say 20,000 examples or 1 epoch. After each set of these examples compare the test
error with train error, if the gap is decreasing, then keep on training. In addition to this, after each such set, save a copy of
your model parameters(so that you can choose from multiple models once it is trained).
Visualize
There are a thousand ways in which the training of a deep learning model might go wrong. I guess we have all been there,
when the model is being trained for hours or days and only after the training is finished, we realize something went wrong.
In order to save yourself from bouts of hysteria, in such situations(which might be quite justified ;)) - always visualize the
training process. Most obvious step you can take is to print/save logs of loss values, train error or test error, etc.
In addition to this, another good practice is to use a visualization library to plot histograms of weights after few training
examples or between epochs. This might help in keeping track of some of the common problems in Deep Learning Models
like Vanishing Gradient, Exploding Gradient etc.
Advent of GPUs, libraries that provide vectorized operations, machines with more computation power, are probably some of
the most significant factors in the success of Deep Learning. If you think, you are patient as a stone, you might try running a
DNN on your laptop(which can’t even open 10 tabs in your Chrome browser) and wait for ages to get your results. Or you
can play smart(and expensively :z) and get a descent hardware with at least multiple CPU cores and a few hundred GPU
cores. GPUs have revolutionized the Deep Learning research(no wonder Nvidia’s stocks are shooting up ;)), primarily
because of their ability to perform Matrix Operations at a larger scale.
So, instead of taking weeks on a normal machine, these parallelization techniques, will bring down the training time to days,
if not hours.
Thankfully, for rapid prototyping we have some really descent libraries like Theano, Tensorflow, Keras, etc. Almost all of
these DL libraries provide support for GPU computation and Automatic Differentiation. So, you don’t have to dive into
core GPU programming(unless you want to - it’s definitely fun :)); nor you have to write your own differentiation code,
which might get a little bit taxing in really complex models(although you should be able to do that, if required). Tensorflow
further provides support for training your models on a distributed architecture(if you can afford it).
This is not at all an exhaustive list of practices, to train a DNN. In order to include just the most common practices, I have
tried to exclude a few concepts like Normalization of inputs, Batch/Layer Normalization, Gradient Check, etc. Although feel
free to add anything in the comment section and I’ll be more than happy to update it in the post. :)
Hyperparameter tuning
A Machine Learning model is defined as a mathematical model with several parameters that need to be learned from the
data. By training a model with existing data, we can fit the model parameters.
However, there is another kind of parameter, known as Hyperparameters, that cannot be directly learned from the regular
training process. They are usually fixed before the actual training process begins. These parameters express important
properties of the model such as its complexity or how fast it should learn. This article aims to explore various strategies to
tune hyperparameters for Machine learning models.
Hyperparameter Tuning
Hyperparameter tuning is the process of selecting the optimal values for a machine learning model’s hyperparameters.
Hyperparameters are settings that control the learning process of the model, such as the learning rate, the number of neurons
in a neural network, or the kernel size in a support vector machine. The goal of hyperparameter tuning is to find the values
that lead to the best performance on a given task.
In the context of machine learning, hyperparameters are configuration variables that are set before the training process of a
model begins. They control the learning process itself, rather than being learned from the data. Hyperparameters are often
used to tune the performance of a model, and they can have a significant impact on the model’s accuracy, generalization, and
other metrics.
Hyperparameters are configuration variables that control the learning process of a machine learning model. They are distinct
from model parameters, which are the weights and biases that are learned from the data. There are several different types of
hyperparameters:
Hyperparameters in Neural Networks
Neural networks have several essential hyperparameters that need to be adjusted, including:
Learning rate: This hyperparameter controls the step size taken by the optimizer during each iteration of
training. Too small a learning rate can result in slow convergence, while too large a learning rate can lead to
instability and divergence.
Epochs: This hyperparameter represents the number of times the entire training dataset is passed through the
model during training. Increasing the number of epochs can improve the model’s performance but may lead to
overfitting if not done carefully.
Number of layers: This hyperparameter determines the depth of the model, which can have a significant impact
on its complexity and learning ability.
Number of nodes per layer: This hyperparameter determines the width of the model, influencing its capacity to
represent complex relationships in the data.
Architecture: This hyperparameter determines the overall structure of the neural network, including the number of
layers, the number of neurons per layer, and the connections between layers. The optimal architecture depends on
the complexity of the task and the size of the dataset
Activation function: This hyperparameter introduces non-linearity into the model, allowing it to learn complex
decision boundaries. Common activation functions include sigmoid, tanh, and Rectified Linear Unit (ReLU).
C: The regularization parameter that controls the trade-off between the margin and the number of training errors. A
larger value of C penalizes training errors more heavily, resulting in a smaller margin but potentially better
generalization performance. A smaller value of C allows for more training errors but may lead to overfitting.
Kernel: The kernel function that defines the similarity between data points. Different kernels can capture different
relationships between data points, and the choice of kernel can significantly impact the performance of the SVM.
Common kernels include linear, polynomial, radial basis function (RBF), and sigmoid.
Gamma: The parameter that controls the influence of support vectors on the decision boundary. A larger value of
gamma indicates that nearby support vectors have a stronger influence, while a smaller value indicates that distant
support vectors have a weaker influence. The choice of gamma is particularly important for RBF kernels.
Hyperparameters in XGBoost
learning_rate: This hyperparameter determines the step size taken by the optimizer during each iteration of
training. A larger learning rate can lead to faster convergence, but it may also increase the risk of overfitting. A
smaller learning rate may result in slower convergence but can help prevent overfitting.
n_estimators: This hyperparameter determines the number of boosting trees to be trained. A larger number of
trees can improve the model’s accuracy, but it can also increase the risk of overfitting. A smaller number of trees
may result in lower accuracy but can help prevent overfitting.
max_depth: This hyperparameter determines the maximum depth of each tree in the ensemble. A larger
max_depth can allow the trees to capture more complex relationships in the data, but it can also increase the risk of
overfitting. A smaller max_depth may result in less complex trees but can help prevent overfitting.
min_child_weight: This hyperparameter determines the minimum sum of instance weight (hessian) needed in a
child node. A larger min_child_weight can help prevent overfitting by requiring more data to influence the splitting
of trees. A smaller min_child_weight may allow for more aggressive tree splitting but can increase the risk of
overfitting.
subsample: This hyperparameter determines the percentage of rows used for each tree construction. A smaller
subsample can improve the efficiency of training but may reduce the model’s accuracy. A larger subsample can
increase the accuracy but may make training more computationally expensive.
Models can have many hyperparameters and finding the best combination of parameters can be treated as a search problem.
The two best strategies for Hyperparameter tuning are:
1. GridSearchCV
2. RandomizedSearchCV
3. Bayesian Optimization
1. GridSearchCV
Grid search can be considered as a “brute force” approach to hyperparameter optimization. We fit the model using all
possible combinations after creating a grid of potential discrete hyperparameter values. We log each set’s model performance
and then choose the combination that produces the best results. This approach is called GridSearchCV, because it searches
for the best set of hyperparameters from a grid of hyperparameters values.
An exhaustive approach that can identify the ideal hyperparameter combination is grid search. But the slowness is a
disadvantage. It often takes a lot of processing power and time to fit the model with every potential combination, which
might not be available.
For example: if we want to set two hyperparameters C and Alpha of the Logistic Regression Classifier model, with different
sets of values. The grid search technique will construct many versions of the model with all possible combinations of
hyperparameters and will return the best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a combination of C=0.3 and Alpha=0.2,
the performance score comes out to be 0.726(Highest), therefore it is selected.
2. RandomizedSearchCV
As the name suggests, the random search method selects values at random as opposed to the grid search method’s use of a
predetermined set of numbers. Every iteration, random search attempts a different set of hyperparameters and logs the
model’s performance. It returns the combination that provided the best outcome after several iterations. This approach
reduces unnecessary computation.
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a fixed number of hyperparameter
settings. It moves within the grid in a random fashion to find the best set of hyperparameters. The advantage is that, in most
cases, a random search will produce a comparable result faster than a grid search.
3. Bayesian Optimization
Grid search and random search are often inefficient because they evaluate many unsuitable hyperparameter combinations
without considering the previous iterations’ results. Bayesian optimization, on the other hand, treats the search for optimal
hyperparameters as an optimization problem. It considers the previous evaluation results when selecting the next
hyperparameter combination and applies a probabilistic function to choose the combination that will likely yield the best
results. This method discovers a good hyperparameter combination in relatively few iterations.
Data scientists use a probabilistic model when the objective function is unknown. The probabilistic model estimates the
probability of a hyperparameter combination’s objective function result based on past evaluation results.
P(score(y)|hyperparameters(x))
It is a “surrogate” of the objective function, which can be the root-mean-square error (RMSE), for example. The objective
function is calculated using the training data with the hyperparameter combination, and we try to optimize it (maximize or
minimize, depending on the objective function selected).
Applying the probabilistic model to the hyperparameters is computationally inexpensive compared to the objective function.
Therefore, this method typically updates and improves the surrogate probability model every time the objective function
runs. Better hyperparameter predictions decrease the number of objective function evaluations needed to achieve a good
result. Gaussian processes, random forest regression, and tree-structured Parzen estimators (TPE) are examples of surrogate
models.
The Bayesian optimization model is complex to implement, but off-the-shelf libraries like Ray Tune can simplify the
process. It’s worth using this type of model because it finds an adequate hyperparameter combination in relatively few
iterations. However, compared to grid search or random search, we must compute Bayesian optimization sequentially, so it
doesn’t allow distributed processing. Therefore, Bayesian optimization takes longer yet uses fewer computational resources.
Model Selection: Choosing the Right Model Architecture for the Task
Computational cost
Time-consuming process
Risk of overfitting
Requires expertise
Neural network CNN hyperparameter tuning are like settings you choose before teaching a neural network to do a task. They
control things like how many layers the network has, how quickly it learns, and how it adjusts its internal values. Picking the
right hyperparameters in deep learning is important to help the network learn effectively and solve the task accurately. It’s a
bit like adjusting the knobs on a machine to make it work just right for a particular job.
Neural Network is a Deep Learning technic to build a model according to training data to predict unseen data using many
layers consisting of neurons. This is similar to other Machine Learning algorithms, except for the use of multiple layers. The
use of multiple layers is what makes it Deep Learning.
Instead of directly building Machine Learning in 1 line, Neural Network requires users to build the architecture before
compiling them into a model. Users will have to arrange how many layers and how many nodes or neurons to build. This is
not found in other conventional Machine Learning algorithms.
I am sure that it is easy to find tutorials on the neural network on the internet. There are also many blogs explaining the
concept behind a neural network. The code to perform hyperparameter-tuning to a neural network also can be found in many
articles and shared notebooks. But, I feel it is quite rare to find a guide of neural network hyperparameter-tuning using
Bayesian Optimization. The articles I found mostly depend on GridSearchCV or RandomizedSearchCV. Meanwhile, a neural
network has many hyperparameters to tune. Bayesian optimization is more efficient in time and memory capacity for tuning
many hyperparameters. I have described the reason in my past article.
Different datasets require different sets of hyperparameters to predict accurately. But, the large number of hyperparameters
makes users difficult to decided which one to choose. There is no answer to how many layers are the most suitable, how
many neurons are the best, or which optimizer suits the best for all datasets. Hyperparameter-tuning is important to find the
possible best sets of hyperparameters to build the model from a specific dataset.
In this article, I will demonstrate the process to tune 2 things of Neural Network: (1) the hyperparameters and (2) the layers.
I find it more difficult to find the latter tutorials than the former. The first one is the same as other conventional Machine
Learning algorithms. The hyperparameters in deep learning to tune are the number of neurons, activation function, optimizer,
learning rate, batch size, and epochs. The second step is to tune the number of layers. This is what other conventional
algorithms do not have. Different layers can affect the accuracy. Fewer layers may give an underfitting result while too many
layers may make it overfitting.
For the hyperparameter-tuning demonstration, I use a dataset provided by Kaggle. I build a simple Multilayer Perceptron
(MLP) neural network to do a binary classification task with prediction probability. The used package in Python is Keras
built on top of Tensorflow. The dataset has an input dimension of 10. There are two hidden layers, followed by one output
layer. The accuracy metric is the accuracy score. The callback of EarlyStopping is used to stop the learning process if there is
no accuracy improvement in 20 epochs. Below is the illustration.
Fig. 1 MLP Neural Network to build. Source: created by myself
When delving into the optimization of neural network hyperparameters, the initial focus lies on tuning the number of
neurons in each hidden layer. Currently, all layers share the same number of neurons, but customization is possible. It’s
crucial to adapt the number of neurons based on the complexity of the solution. Tasks with higher complexity demand an
increased number of neurons. The specified range for the number of neurons spans from 10 to 100, offering flexibility in
fine-tuning neural network CNN hyperparameter tuning to suit varying solution complexities.
An activation function is a parameter in each layer. Input data are fed to the input layer, followed by hidden layers, and the
final output layer. The output layer contains the output value. The input values moving from a layer to another layer keep
changing according to the activation function. The activation function decides how to compute the input values of a layer
into output values. The output values of a layer are then passed to the next layer as input values again. The next layer then
computes the values into output values for another layer again. There are 9 activation functions to tune in to this
demonstration. Each activation function has its own formula (and graph) to compute the input values. It will not be discussed
in this article.
The layers of a neural network are compiled and an optimizer is assigned. The optimizer is responsible to change the
learning rate and weights of neurons in the neural network to reach the minimum loss function. Optimizer is very important
to achieve the possible highest accuracy or minimum loss. There are 7 optimizers to choose from. Each has a different
concept behind it.
Hyperparameter Resources
One of the hyperparameters in deep learning in the optimizer is the learning rate. We will also tune the learning rate.
Learning rate controls the step size for a model to reach the minimum loss function. A higher learning rate makes the model
learn faster, but it may miss the minimum loss function and only reach the surrounding of it. A lower learning rate gives a
better chance to find a minimum loss function. As a tradeoff lower learning rate needs higher epochs, or more time and
memory capacity resources.
When dealing with large training datasets, building a model can be time-consuming. To expedite the learning process, we
can optimize hyperparameters in neural networks, such as the batch size. By assigning a batch size, not all training data are
fed to the model simultaneously. For instance, with a dataset of 77,500 observations and a batch size of 1000, the model
undergoes 77 iterations with 1000 training data sub-samples and a final iteration with the remaining 500 sub-samples. A
smaller batch size accelerates learning but may increase variance in validation dataset accuracy. Conversely, a larger batch
size slows learning while stabilizing validation dataset accuracy variance.
The number of times a complete dataset passes through the neural network model is referred to as an epoch. Essentially, one
epoch involves the training dataset moving forward and backward through the neural network once. If the number of epochs
is too small, it may result in underfitting, indicating that the neural network hasn’t learned sufficiently. Multiple passes or
epochs are necessary for effective learning. Conversely, excessive epochs can lead to overfitting, where the model excels in
predicting existing data but struggles with new, unseen data. Tuning the number of epochs is crucial for optimal results. In
this demonstration, we aim to find the ideal number of epochs within the range of 20 to 100, emphasizing the importance of
hyperparameters in deep learning in neural networks
Below is the code to tune the hyperparameters of a neural network as described above using Bayesian Optimization. The
tuning searches for the optimum CNN hyperparameter tuning based on 5-fold cross-validation. The following code imports
useful packages for Neural Network modeling.
# Import packages
import numpy as np
import pandas as pd
from keras.optimizers import Adam, SGD, RMSprop, Adadelta, Adagrad, Adamax, Nadam, Ftrl
LeakyReLU = LeakyReLU(alpha=0.1)
import warnings
warnings.filterwarnings('ignore')
score_acc = make_scorer(accuracy_score)
This code loads training and test datasets. It then splits the dataset into another training dataset and validation dataset. The
validation dataset is 20% of the total dataset. The dataset is split according to the target variable.Copy Code
# Load dataset
trainSet = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')
train = train.dropna(axis=0)
train = pd.get_dummies(train)
train['Survived'],
test_size=0.2, random_state=111,
stratify=train['Survived'])Copy Code
The following code creates the objective function containing the Neural Network model. The function will return returns the
score of the cross-validation.
# Create function
'RMSprop':RMSprop(lr=learning_rate), 'Adadelta':Adadelta(lr=learning_rate),
'Adagrad':Adagrad(lr=learning_rate), 'Adamax':Adamax(lr=learning_rate),
'Nadam':Nadam(lr=learning_rate), 'Ftrl':Ftrl(lr=learning_rate)}
neurons = round(neurons)
activation = activationL[round(activation)]
batch_size = round(batch_size)
epochs = round(epochs)
def nn_cl_fun():
nn = Sequential()
nn.add(Dense(neurons, activation=activation))
nn.add(Dense(1, activation='sigmoid'))
return nn
verbose=0)
The code below sets the range of CNN hyperparameter tuning and run the Bayesian Optimization
# Set paramaters
params_nn ={
'activation':(0, 9),
'optimizer':(0,7),
'learning_rate':(0.01, 1),
'batch_size':(200, 1000),
'epochs':(20, 100)
Output:
-------------------------------------------------------------------------------------------------
params_nn_ = nn_bo.max['params']
params_nn_['activation'] = activationL[round(params_nn_['activation'])]
params_nn_Copy Code
Output:
{'activation': 'selu',
'batch_size': 851.0135336291902,
'epochs': 53.7054301919375,
'learning_rate': 0.037173480215022196,
'neurons': 50.872297884262295,
In the code above, the neural network is built in line 14 to line 24 of the #Create function starting from the function def
nl_cl_fun. The neural network layers architecture is built before performing the cross-validation. This is different from
conventional Machine Learning. Other Machine Learning does need to build the architecture like in def nl_cl_fun before
performing the cross-validation.
Layers in Neural Network hyperparameters also determine the result of the prediction model. A smaller number of layers is
enough for a simpler problem, but a larger number of layers is needed to build a model for a more complicated problem. The
number of layers can be tuned using the “for loop” iteration. This demonstration tune the number of layers two times. Each
time, the number of layers is tuned between 1 to 3.
Inserting regularization layers in a neural network can help prevent overfitting. This demonstration tries to tune whether to
add regularization layers or not. There are two regularization layers to use here.
Batch normalization is placed after the first hidden layers. The batch normalization layer normalizes the values passed to it
for every batch. This is similar to standard scaler in conventional Machine Learning.
Another regularization layer is the Dropout layer. The dropout layer, as its name suggests, randomly drops a certain number
of neurons in a layer. The dropped neurons are not used anymore. The rate of how much percentage of neurons to drop is set
in the dropout rate. The following is the code to tune the CNN hyperparameter tuning and layers at the same time.
Fig. 3 Dropout layer illustration. Source: created by myself
The following code creates a function for tuning the Neural Network hyperparameters and layers.
# Create function
'RMSprop':RMSprop(lr=learning_rate), 'Adadelta':Adadelta(lr=learning_rate),
'Adagrad':Adagrad(lr=learning_rate), 'Adamax':Adamax(lr=learning_rate),
'Nadam':Nadam(lr=learning_rate), 'Ftrl':Ftrl(lr=learning_rate)}
neurons = round(neurons)
activation = activationL[round(activation)]
optimizer = optimizerD[optimizerL[round(optimizer)]]
batch_size = round(batch_size)
epochs = round(epochs)
layers1 = round(layers1)
layers2 = round(layers2)
def nn_cl_fun():
nn = Sequential()
nn.add(BatchNormalization())
for i in range(layers1):
nn.add(Dense(neurons, activation=activation))
nn.add(Dropout(dropout_rate, seed=123))
for i in range(layers2):
nn.add(Dense(neurons, activation=activation))
nn.add(Dense(1, activation='sigmoid'))
return nn
return score
Copy Code
The following code searches for the optimum hyperparameters and layers for the Neural Network model.
params_nn2 ={
'activation':(0, 9),
'optimizer':(0,7),
'learning_rate':(0.01, 1),
'batch_size':(200, 1000),
'epochs':(20, 100),
'layers1':(1,3),
'layers2':(1,3),
'normalization':(0,1),
'dropout':(0,1),
'dropout_rate':(0,0.3)
nn_bo.maximize(init_points=25, n_iter=4)
Output:
| iter | target | activa... | batch_... | dropout | dropou... | epochs | layers1 | layers2 | learni... | neurons | normal... |
optimizer |
------------------------------------------------------------------------------------------------------------------------------------------------------
-------
| 1 | 0.6293 | 5.51 | 335.3 | 0.4361 | 0.2308 | 43.63 | 1.298 | 1.045 | 0.426 | 31.48 | 0.3377 |
6.935 |
| 2 | 0.6502 | 2.14 | 265.0 | 0.6696 | 0.1864 | 41.94 | 1.932 | 1.237 | 0.08322 | 91.07 | 0.794 |
5.884 |
| 3 | 0.5719 | 7.337 | 992.8 | 0.5773 | 0.2441 | 53.71 | 1.055 | 1.908 | 0.1143 | 83.55 | 0.6977 |
3.957 |
| 4 | 0.5886 | 2.468 | 998.8 | 0.138 | 0.1846 | 58.8 | 1.81 | 2.456 | 0.3296 | 46.05 | 0.319 | 6.631
|
| 5 | 0.5719 | 8.268 | 851.1 | 0.03408 | 0.283 | 96.04 | 2.613 | 1.963 | 0.9671 | 47.53 | 0.3188 |
0.1151 |
| 6 | 0.768 | 0.3436 | 242.5 | 0.128 | 0.01001 | 38.11 | 2.088 | 1.357 | 0.1876 | 23.47 | 0.683 |
3.283 |
| 7 | 0.5719 | 6.914 | 735.1 | 0.4413 | 0.1786 | 56.93 | 2.927 | 1.296 | 0.9077 | 54.81 | 0.5925 |
4.793 |
| 8 | 0.767 | 1.597 | 891.7 | 0.4821 | 0.0208 | 49.18 | 1.723 | 1.944 | 0.1877 | 25.78 | 0.9491 | 4.59
|
| 9 | 0.5432 | 1.215 | 942.2 | 0.8418 | 0.01583 | 36.29 | 2.745 | 2.348 | 0.3043 | 76.1 | 0.6183 |
1.473 |
| 10 | 0.5719 | 7.219 | 247.3 | 0.3082 | 0.06221 | 97.78 | 2.819 | 2.353 | 0.124 | 96.22 | 0.09171 |
4.409 |
| 11 | 0.5892 | 8.126 | 471.8 | 0.6528 | 0.2775 | 49.92 | 2.543 | 2.792 | 0.624 | 23.6 | 0.3749 |
4.451 |
| 12 | 0.5719 | 4.132 | 625.8 | 0.3523 | 0.198 | 58.12 | 1.909 | 1.25 | 0.4183 | 34.58 | 0.3467 |
6.821 |
| 13 | 0.7683 | 1.94 | 746.3 | 0.03181 | 0.2506 | 76.13 | 2.932 | 2.184 | 0.2252 | 74.73 | 0.03087 |
2.931 |
| 14 | 0.5764 | 2.531 | 285.0 | 0.4263 | 0.2522 | 28.83 | 2.973 | 1.467 | 0.7242 | 69.48 | 0.07776 |
4.881 |
| 15 | 0.768 | 2.388 | 921.5 | 0.8183 | 0.1198 | 85.62 | 1.396 | 2.045 | 0.4184 | 93.33 | 0.8254 |
3.507 |
| 16 | 0.7684 | 1.051 | 209.3 | 0.9132 | 0.1537 | 87.45 | 1.19 | 2.607 | 0.07161 | 67.19 | 0.9688 |
2.782 |
| 17 | 0.5144 | 5.936 | 371.9 | 0.8899 | 0.296 | 79.09 | 2.283 | 1.504 | 0.4811 | 34.13 | 0.8683 |
1.868 |
| 18 | 0.5719 | 8.757 | 370.8 | 0.2978 | 0.221 | 21.03 | 1.06 | 2.468 | 0.5033 | 29.63 | 0.00893 |
5.955 |
| 19 | 0.7635 | 4.828 | 778.8 | 0.6616 | 0.2516 | 51.06 | 1.852 | 2.656 | 0.4743 | 83.8 | 0.01418 |
2.777 |
| 20 | 0.5144 | 1.155 | 294.5 | 0.206 | 0.2243 | 94.41 | 1.761 | 1.921 | 0.8746 | 83.31 | 0.02497 |
6.111 |
| 21 | 0.5442 | 5.441 | 613.2 | 0.5893 | 0.2399 | 33.86 | 1.374 | 1.516 | 0.06056 | 59.74 | 0.3518 |
6.419 |
| 22 | 0.767 | 4.289 | 283.6 | 0.1525 | 0.08206 | 82.52 | 1.786 | 2.598 | 0.4387 | 17.34 | 0.01064 |
3.016 |
| 23 | 0.7437 | 5.966 | 612.2 | 0.5801 | 0.1479 | 79.24 | 2.579 | 2.562 | 0.1363 | 94.61 | 0.8777 |
4.897 |
| 24 | 0.6826 | 8.432 | 739.0 | 0.5944 | 0.1035 | 26.69 | 2.159 | 1.035 | 0.5569 | 66.93 | 0.6784 |
1.194 |
| 25 | 0.576 | 5.194 | 364.8 | 0.2515 | 0.2908 | 91.73 | 1.246 | 2.762 | 0.9485 | 51.39 | 0.413 | 4.04
|
| 26 | 0.6123 | 0.8666 | 764.0 | 0.09547 | 0.2738 | 71.59 | 2.418 | 2.742 | 0.01 | 89.31 | 0.0 | 1.49
|
| 27 | 0.7422 | 6.366 | 780.2 | 0.6271 | 0.1646 | 53.26 | 1.954 | 2.228 | 0.6962 | 81.66 | 0.1557 |
2.563 |
| 28 | 0.5144 | 4.821 | 779.7 | 0.8649 | 0.1344 | 37.63 | 2.574 | 1.528 | 0.3698 | 79.91 | 0.7947 |
5.56 |
| 29 | 0.5719 | 0.509 | 920.4 | 0.6302 | 0.2337 | 83.36 | 2.121 | 2.895 | 0.9025 | 99.29 | 0.8399 |
6.796 |
params_nn_ = nn_bo.max['params']
learning_rate = params_nn_['learning_rate']
params_nn_['batch_size'] = round(params_nn_['batch_size'])
params_nn_['epochs'] = round(params_nn_['epochs'])
params_nn_['layers1'] = round(params_nn_['layers1'])
params_nn_['layers2'] = round(params_nn_['layers2'])
params_nn_['neurons'] = round(params_nn_['neurons'])
'RMSprop':RMSprop(lr=learning_rate), 'Adadelta':Adadelta(lr=learning_rate),
'Adagrad':Adagrad(lr=learning_rate), 'Adamax':Adamax(lr=learning_rate),
'Nadam':Nadam(lr=learning_rate), 'Ftrl':Ftrl(lr=learning_rate)}
params_nn_['optimizer'] = optimizerD[optimizerL[round(params_nn_['optimizer'])]]
params_nn_Copy Code
Output:
{'activation': 'sigmoid',
'batch_size': 209,
'dropout': 0.9131504384208619,
'dropout_rate': 0.15371924329624512,
'epochs': 87,
'layers1': 1,
'layers2': 3,
'learning_rate': 0.07160587078837888,
'neurons': 67,
'normalization': 0.9687811501818422,
It has 67 neurons for each layer. There is a batch normalization after the first hidden layer, followed by 1 neuron hidden
layer. Next, the Dropout layer drops 15% of the neurons before the values are passed to 3 more neuron hidden layers.
Finally, the output layer has one neuron containing the probability value. See Figure 4 for the illustration. Now that we have
the optimal hyperparameters and layers with the estimated accuracy of 0.7684, let’s fit it into the training dataset. Eventually,
we get an accuracy of 0.7681 for the validation dataset. The notebook for this article is made available here.
def nn_cl_fun():
nn = Sequential()
nn.add(BatchNormalization())
for i in range(params_nn_['layers1']):
nn.add(Dense(params_nn_['neurons'], activation=params_nn_['activation']))
if params_nn_['dropout'] > 0.5:
nn.add(Dropout(params_nn_['dropout_rate'], seed=123))
for i in range(params_nn_['layers2']):
nn.add(Dense(params_nn_['neurons'], activation=params_nn_['activation']))
nn.add(Dense(1, activation='sigmoid'))
return nn
verbose=0)
Output:
Epoch 1/87
Epoch 2/87
Epoch 3/87
...
Epoch 87/87
<tensorflow.python.keras.callbacks.History at 0x7fa67610c750>
Fig. 4 Illustration of the final model. Source: created by myself
Conclusion
In summary, delving into neural network hyperparameters is essential for deep learning success. By skillfully tuning
parameters, especially optimizing layers, one can elevate model performance significantly. Explore the nuances of
hyperparameter tuning to unlock the full potential of neural networks in the realm of deep learning.
Key Takeaways:
Share
Uncover the hidden layers inside neural networks and learn what happens in between the input and output, with specific
examples from convolutional, recurrent, and generative adversarial neural networks.
The hidden layers of neural networks are the brains behind artificial intelligence. Let’s imagine a neural network: For
example, imagine you’re using a program on your phone to identify a plant. You are out walking one day when you see an
interesting specimen, so you snap a picture and upload it to the app. After a second or two, the program provides you with a
guess about what kind of plant you might be looking at.
From your perspective, you provided an input (the image) and received an output (the plant's name). However, a complex
process occurs to make this happen. Inside the neural network, operating between the input and output, lies hidden layers
that enable the neural network to function. It may seem like a simple process to you, the end user, but the data you put into
the algorithm can pass through hundreds of layers of neurons, depending on the depth of the neural network.
The number, types, and architecture of hidden layers inside a neural network are wildly different based on what function the
neural network performs. Discover more about the hidden layer of neural networks with this article, including convolutional,
recurrent, and generative adversarial neural networks.
The three layers of neural networks are the input layer, the hidden layer, and the output layer. As you may guess by the name,
the hidden layers aren’t something you interact with directly when using a neural network. On the other hand, you can
control what goes into the algorithm (the input) and access what comes out of the algorithm (the output). Every layer in
between those two is a part of the hidden layers of a neural network.
If you tried to use a neural network without a hidden layer, your output would be nothing more than a repeat of your input.
As the input transfers from one hidden layer to the next, the neural network learns the patterns associated with the input,
providing the network—artificial intelligence (AI)—with a hierarchical representation of the input data. Eventually, this
process transforms the data into the output.
Although a simple neural network might contain one hidden layer, including additional layers allows the network to interact
with the input in more complex ways. Each hidden layer can look at a different aspect of the input or recognize more specific
features layer by layer, compounding this information from one layer to the next. This technology of stacking hidden layers
inside a neural network is the basis for deep learning. Essentially, for an algorithm to be considered “deep learning,” it must
contain more than three layers: the input, one hidden layer, and the output. The more layers a neural network contains, the
“deeper” the learning algorithm.
In short, neural networks don’t require more than one hidden layer. Computer scientists and artificial intelligence engineers
have been using multilayer neural networks since the invention of the multilayer perceptron in 1958.
Adding additional layers allows the neural networks to make more complex calculations or interact with the data more
sophisticatedly. Stacking large numbers of hidden layers between the input and the output allows your neural network to
perform deep-learning tasks regarding the input. The number of layers in each neural network depends on the goals you wish
to accomplish with the algorithm, but deep learning isn’t possible with a single hidden layer between the input and the output
layers.
We’ve identified that hidden layers of neural networks are everything between the input and output layers. We also discussed
how additional layers make the artificial intelligence “deeper.” But what exactly do hidden layers do within the context of
neural network architecture?
Every hidden layer inside a neural network contains neurons known as nodes. Each node interacts with the data in one
simple way. For example, maybe it completes a calculation, detects the top-right corner of an image, or identifies that an
item is the color red. Each node in each layer connects to each node in the next layer, passing down what it discovered about
the data. Data flows through the input and then enters the first layer of nodes, which changes or reacts to the data and passes
the new information set to the next hidden layer.
Depending on your goals for the neural network, many layers of highly specialized nodes give the AI the ability to
understand complex relationships between data. Some of the most common types of neural networks are recurrent,
generative adversarial, and convolutional. Let’s take a closer look at the job of the hidden layers inside each of these neural
networks.
A recurrent neural network works specifically with time series data or sequential data. For example, recurrent neural
networks can assist you in making predictions about future events based on past data or in replicating human speech by
understanding how the word order in a sentence affects the meaning of the sentence. The hidden layers in a recurrent neural
network contain the same nodes described above but have the ability to remember the results of previous calculations. Using
those stored “memories,” the neural network can then predict the current input based on past results.
These stored memories are in a hidden state within the hidden layers. The hidden state carries a weight, and the input also
carries a weight. In other words, the hidden layer of a recurrent neural network considers both the input on its own and the
hidden state of past inputs and makes a new prediction using the experience and learning of the past.
A convolutional neural network (CNN) is well-equipped for classification or computer vision tasks. For example, a
convolutional neural network can read medical imaging such as X-rays or MRIs to screen you for anomalies faster and more
accurately than medical professionals.
Convolutional neural networks have three main types of hidden layers: convolutional layers, pooling layers, and FC, or fully
connected layers.
Convolutional layers: Convolutional layers use feature extraction to perform the main “work” of the CNN. A
feature detector, or kernel, scans the input and creates a map of the features it finds. A convolutional neural
network can have many convolutional layers, with each layer adding a more nuanced understanding of the input.
Pooling layers: The pooling layer of a convolutional neural network simplifies the work of the convolutional
layers, losing data in the process but gaining a more efficient and less complex output. The convolutional layer
focuses on extracting all the features, while the pooling layer provides you with a less computationally challenging
data set.
Fully-connected layers: The fully-connected (FC) layer of the CNN is so named because every node in this layer
connects to every node in the previous layer. Essentially, the CNN process is threefold: Once the convolutional
layer extracts features from the data, this information passes to the pooling layer, where it is downsized; it then
transfers to the FC layer, which maps the information gained from the previous two layers, classifying it for output.
A generative adversarial network (GAN) is a tool that utilizes a competitive methodology for generating text or images.
Within the hidden layers of a GAN network live two distinct neural networks: the generator and the discriminator. The two
artificial minds are playing a game against one another to generate the output.
Both neural networks can access the training data or the data set you select to represent what the AI “knows.” The generator
attempts to create a fake item so convincing that the discriminator can’t differentiate between the forgery and the real thing.
The discriminator attempts to select each forgery.
The output is created when the generator “wins” the game, and the generator accomplishes this by creating a unique item so
similar to those in the training set that the discriminator can’t tell it’s a fake. The algorithm can improve and learn over time
because the nature of the competition drives both teams to improve and innovate to gain an edge over their competitor.
The hidden layers of a generative adversarial network also use backpropagation, a method of looping the input back through
the network again before moving on toward the output. In this way, the generator and the discriminator can play their game
repeatedly until the generator succeeds in tricking the discriminator, providing you with a novel output.
The exact makeup of the hidden layers in a generative adversarial network depends on the exact model of the artificial
intelligence. For example, a deep convolutional generative adversarial network (DCGAN) uses convolutional layers to
accomplish the same goal slightly differently.
When we talk about the Machine Learning model, we actually talk about how well it performs and its accuracy which is
known as prediction errors. Let us consider that we are designing a machine learning model. A model is said to be a good
machine learning model if it generalizes any new input data from the problem domain in a proper way. This helps us to make
predictions about future data, that the data model has never seen. Now, suppose we want to check how well our machine
learning model learns and generalizes to the new data. For that, we have overfitting and underfitting, which are majorly
responsible for the poor performances of the machine learning algorithms.
Bias: Bias refers to the error due to overly simplistic assumptions in the learning algorithm. These assumptions
make the model easier to comprehend and learn but might not capture the underlying complexities of the data. It is
the error due to the model’s inability to represent the true relationship between input and output accurately. When a
model has poor performance both on the training and testing data means high bias because of the simple model,
indicating underfitting.
Variance: Variance, on the other hand, is the error due to the model’s sensitivity to fluctuations in the training data.
It’s the variability of the model’s predictions for different instances of training data. High variance occurs when a
model learns the training data’s noise and random fluctuations rather than the underlying pattern. As a result, the
model performs well on the training data but poorly on the testing data, indicating overfitting.
A statistical model or a machine learning algorithm is said to have underfitting when a model is too simple to capture data
complexities. It represents the inability of the model to learn the training data effectively result in poor performance both on
the training and testing data. In simple terms, an underfit model’s are inaccurate, especially when applied to new, unseen
examples. It mainly happens when we uses very simple model with overly simplified assumptions. To address underfitting
problem of the model, we need to use more complex models, with enhanced feature representation, and less regularization.
Note: The underfitting model has High bias and low variance.
1. The model is too simple, So it may be not capable to represent the complexities in the data.
2. The input features which is used to train the model is not the adequate representations of underlying factors
influencing the target variable.
4. Increase the number of epochs or increase the duration of training to get better results.
A statistical model is said to be overfitted when the model does not make accurate predictions on testing data. When a model
gets trained with so much data, it starts learning from the noise and inaccurate data entries in our data set. And when testing
with test data results in High variance. Then the model does not categorize the data correctly, because of too many details
and noise. The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning
algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic
models. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the
maximal depth if we are using decision trees.
In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training data is different from
unseen data.
1. Improving the quality of training data reduces overfitting by focusing on meaningful patterns, mitigate the risk of
fitting the noise or irrelevant features.
2. Increase the training data can improve the model’s ability to generalize to unseen data and reduce the likelihood of
overfitting.
4. Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins
to increase stop training).
Ideally, the case when the model makes the predictions with 0 error, is said to have a good fit on the data. This situation is
achievable at a spot between overfitting and underfitting. In order to understand it, we will have to look at the performance
of our model with the passage of time, while it is learning from the training dataset.
With the passage of time, our model will keep on learning, and thus the error for the model on the training and testing data
will keep on decreasing. If it will learn for too long, the model will become more prone to overfitting due to the presence of
noise and less useful details. Hence the performance of our model will decrease. In order to get a good fit, we will stop at a
point just before where the error starts increasing. At this point, the model is said to have good skills in training datasets as
well as our unseen testing dataset.
In deep learning, the "generalization gap" refers to the difference in performance between how well a model performs on its
training data compared to its performance on unseen data, and "underfitting" occurs when a model performs poorly on both
training and new data due to being too simple, while "overfitting" happens when a model performs exceptionally well on
training data but poorly on new data because it has essentially memorized the training data too closely, leading to a large
generalization gap.
Underfitting:
Cause: A model that is too simple, not capturing the underlying patterns in the data well.
Characteristics: High error on both training and test data, indicating the model has not learned enough
from the training data.
Overfitting:
Cause: A model that is too complex, learning the noise in the training data rather than the actual
patterns.
Characteristics: Very low error on training data but high error on test data, showing the model is not
generalizing well.
Solution: Use regularization techniques like L1/L2, dropout, data augmentation, or early stopping to
prevent the model from overfitting.
o Small gap between training and validation errors, with both being high, suggests underfitting.
The goal in deep learning is to find the "sweet spot" where the model is complex enough to capture the relevant patterns in
the data but not so complex that it overfits to the training data, minimizing the generalization gap.
What is underfitting? Underfitting is another type of error that occurs when the model cannot determine a meaningful
relationship ...
First, we want to watch out for cases when our training error and validation error are both substantial but there is a little
gap ...
D2L.ai
This understanding will guide you to take corrective steps. We can determine whether a predictive model is underfitting or
overfit...
AWS Documentation
Show all
Generative AI is experimental.
हिन्दी में
In English
Successful data analysis methods balance training data fit with complexity since: too complex model (to fit training data
well) leads to overfitting (i.e., model does not generalize) whereas: too simplistic models (to avoid overfitting) lead to
underfitting (will generalize but the fit to both the training and future
Maciej Balawejder
Follow
Published in
6 min read
30
Training Deep Neural Networks is a challenging quest. Over the years, researchers have come up with different methods
to accelerate and stabilize the learning process. Normalization is one technique that proved to be very effective in doing
this.
Different types of normalizations
In this blog, I will review every one of these methods using analogies and visualization, which will help you understand the
motivation and thought process behind them.
Why Normalization?
Imagine that we have two features and a simple neural network. One is age with a range between 0 and 65, and another
is salary ranging from 0 to 10 000. We feed those features to the model and calculate gradients.
n — example
Different scales of inputs cause different weights updates and optimizers steps towards the minimum. It also makes the shape
of the loss function disproportional. In that case, we need to use a lower learning rate to not overshoot, which means a
slower learning process.
The solution is input normalization. It scales down features by subtracting the mean(centering) and dividing by the
standard deviation.
n — example
This process is also called ‘whitening’, where the values have 0 mean and unit variance. It provides faster convergence and
more stable training.
It’s such a neat solution, so why we don’t normalize the activation of each layer in the network?
Activation
Batch Normalization
N — batch, C — channels, H,W— spatial width and height. Source
In 2015 Sergey Ioffe and Christian Szegedy[3] picked up that idea to solve the internal covariate shift issue. In plain English
it means that the input layer distribution is constantly changing due to weight update. In this case, the following layer
always needs to adapt to the new distribution. It causes slower convergence and unstable training.
Batch Normalization presents a way to control and optimize the distribution after each layer. The process is identical to the
input normalization, but we add two learnable parameters, γ, and β.
Instead of putting all the maths equations, I created code snippets that I find more readable and intuitive.
These two parameters are learned along the network using backpropagation. They optimize the distribution by scaling(γ)
and shifting(β) activations.
Since we have fixed distributions, we can increase the learning rate and speed up the convergence. Besides computational
boost, BN also serves as a regularisation technique. The noise generated by approximation of the dataset’s statistics
removes the need for a Dropout.
But it’s a double-edged sword. This estimation is only tolerable for larger batches. When the number of examples is smaller,
the performance decreases dramatically.
Another downside of the BN is testing time. Let’s use a self-driving car as an example. You pass a single frame recorded by
the camera during the driving rather than the batch of images. In this case, the network has to use pre-computed mean and
variance from training, which might lead to different results.
The significance of this problem pushed the community to create alternative methods to avoid dependency on the batch.
Layer Normalization
Source
It is the first attempt made by Geoffrey E. Hinton et al. in 2016[4] to reduce the batch size constraints. Mainly because of the
recurrent neural networks, which was unclear how to apply BN to them.
RNN architecture
In Deep Neural Networks, it’s easy to store statistics for each BN layer since the number of layers is fixed. However, in
RNNs, the input and output shapes vary in length. So, in this case, it’s better to normalize using statistics of a single
timestep(example) rather than the whole batch.
In this method, every example in batch(N) is normalized across [C, H, W] dimensions. Like BN, it speeds up and stabilizes
training but without the constrain to the batch. Additionally, this method can be used in online learning tasks where the
batch is equal to 1.
Instance Normalization
Source
Instance Normalization was introduced in the 2016 paper[5] by Dmitry Ulyanov et al. It was another attempt to reduce
dependency on the batch to improve the results of the style transfer network.
Normalizing across batch and channel allows removing specific contrast information from the image, which helps
with generalization.
This method gained popularity among generative models like Pix2Pix or CycleGAN and became a precursor to the
Adaptive Instance Normalization used in the famous StyleGAN2.
Group Normalization
Source
Group Normalization was introduced in the 2018[1] paper, and it directly addresses the BN limitations for CNNs. The main
accusation is distributed learning, where the batch is split into many machines. These are trained on a small number of
examples like 6–8 and, and in some cases, even 1–2.
Distributed Learning
To fix it, they introduce a hybrid of layer and instance normalization. GN divides channels into groups and normalizes
across them. This scheme makes computation independent of the batch sizes.
GN outperforms BN trained on smaller batches but can’t beat the larger batch results. Nevertheless, it was a good starting
point that led to another method which combined with GN, exceeds the BN’s results.
ResNet-50’s Validation Error on ImageNet. Source
Weights
Weight Standardization
Source
We have already normalized inputs and layer outputs. The only thing that was left was weights. They
can grow large without any control, especially when we normalize output anyway. By standardizing weights, we achieve a
more smooth loss landscape and more stable training.
As I mentioned before, weight standardization is an excellent accompaniment to group normalization. Combining those
methods produces better results than BN(with large batches) using only one sample per machine.
They also present the BCN method called Batch-Channel Normalization. In a nutshell it’s just BN and GN used at the same
time for each layer.
Conclusions
Normalization is an essential concept in Deep Learning. It speeds up computation and stabilizes training. There are plenty of
different techniques that evolved over the years. Hopefully, you got the underlying idea behind them and now know with
certainty why and where to use them in your project.
Normalization is an essential step in the preprocessing of data for machine learning models, and it is a feature scaling
technique. Normalization is especially crucial for data manipulation, scaling down, or up the range of data before it is
utilized for subsequent stages in the fields of soft computing, cloud computing, etc. Min-max scaling and Z-
Score Normalisation (Standardisation) are the two methods most frequently used for normalization in feature scaling.
Normalization is scaling the data to be analyzed to a specific range such as [0.0, 1.0] to provide better results.
Data normalization is a vital pre-processing, mapping, and scaling method that helps forecasting and prediction models
become more accurate. The current data range is transformed into a new, standardized range using this method.
Normalization is extremely important when it comes to bringing disparate prediction and forecasting techniques into
harmony. Data normalization improves the consistency and comparability of different predictive models by standardizing the
range of independent variables or features within a dataset, leading to more steady and dependable results.
Normalisation, which involves reshaping numerical columns to conform to a standard scale, is essential for datasets with
different units or magnitudes across different features. Finding a common scale for the data while maintaining the intrinsic
variations in value ranges is the main goal of normalization. This usually entails rescaling the features to a standard range,
which is typically between 0 and 1. Alternatively, the features can be adjusted to have a mean of 0 and a standard deviation
of 1.
Z-Score Normalisation (Standardisation) and Min-Max Scaling are two commonly used normalisation techniques. In order
to enable more insightful and precise analyses in a variety of predictive modelling scenarios, these techniques are essential in
bringing disparate features to a comparable scale.
There are several reasons for the need for data normalization as follows:
Normalisation is essential to machine learning for a number of reasons. Throughout the learning process, it
guarantees that every feature contributes equally, preventing larger-magnitude features from overshadowing others.
It enables faster convergence of algorithms for optimisation, especially those that depend on gradient descent.
Normalisation improves the performance of distance-based algorithms like k-Nearest Neighbours.
Normalisation improves overall performance by addressing model sensitivity problems in algorithms such
as Support Vector Machines and Neural Networks.
Because it assumes uniform feature scales, it also supports the use of regularisation techniques like L1 and L2
regularisation.
In general, normalisation is necessary when working with attributes that have different scales; otherwise, the
effectiveness of a significant attribute that is equally important (on a lower scale) could be diluted due to other
attributes having values on a larger scale.
Min-Max normalization:
This method of normalising data involves transforming the original data linearly. The data’s minimum and maximum values
are obtained, and each value is then changed using the formula that follows.
The formula works by subtracting the minimum value from the original value to determine how far the value is from the
minimum. Then, it divides this difference by the range of the variable (the difference between the maximum and minimum
values).
This division scales the variable to a proportion of the entire range. As a result, the normalized value falls between 0 and 1.
When the feature X is at its minimum, the normalized value ( ) is 0. This is because the numerator becomes
zero.
For values between the minimum and maximum, ranges between 0 and 1, preserving the relative position
of X within the original range.
The data is normalised by shifting the decimal point of its values. By dividing each data value by the maximum absolute
value of the data, we can use this technique to normalise the data. The following formula is used to normalise the data value,
v, of the data to v’:
where is the normalized value, is the original value, and is the smallest integer such
that . This formula involves dividing each data value by an appropriate power of 10 to
ensure that the resulting normalized values are within a specific range.
Using the mean and standard deviation of the data, values are normalised in this technique to create a standard normal
distribution (mean: 0, standard deviation: 1). The equation that is applied is:
where,
is the mean of the data A is the standard deviation.
Normalization Standardization
Normalization scales the values of a feature to a specific Standardization scales the features to have a mean of 0
range, often between 0 and 1. and a standard deviation of 1.
Applicable when the feature distribution is uncertain. Effective when the data distribution is Gaussian.
Maintains the shape of the original distribution Alters the shape of the original distribution.
Scales values to ranges like [0, 1]. Scale values are not constrained to a specific range.
The kind of data being used and the particular needs of the machine learning algorithm being used will determine whether to
use normalization or standardization.
When the data distribution is unknown or non-Gaussian, normalization—which is frequently accomplished through MinMax
scaling—is especially helpful. It works well in situations when maintaining the distribution’s original shape is essential.
Since this method scales values between [0, 1], it can be used in applications where a particular range is required.
Normalisation is more susceptible to outliers, so it might not be the best option when there are extreme values.
However, when the distribution of the data is unknown or assumed to be Gaussian, standardization—achieved through Z-
score normalization—is preferred. Values can be more freely chosen because standardisation does not limit them to a
predetermined range. Additionally, because it is less susceptible to outliers, it can be used with datasets that contain extreme
values. Although standardisation modifies the initial distribution shape, it is beneficial in situations where preserving the
relationships between data points is crucial.
The removal of redundant and null values to produce more compact data.
Conceptual clarity and simplicity of upkeep, enabling simple adaptations to changing needs.
Because more rows can fit on a data page with narrower tables, searching, sorting, and index creation are more
efficient.
There are various drawbacks to normalizing a database. A few disadvantages are as follows:
It gets harder to link tables together when the information is spread across multiple ones. It gets even more
interesting to identify the database.
Given that rewritten data is saved as lines of numbers rather than actual data, tables will contain codes rather than
actual data. That means that you have to keep checking the query table.
This information model is very hard to query because it is meant for programmes, not ad hoc queries. Operating
system friendly query devices frequently perform this function. It is composed of SQL that has been accumulated
over time. If you don’t first understand the needs of the client, it may be challenging to demonstrate knowledge
and understanding.
Compared to a typical structural type, the show’s pace gradually slows down.
Conclusion
To summarise, one of the most important aspects of machine learning preprocessing is data normalisation, which can be
achieved by using techniques such as Min-Max Scaling and Z-Score Normalisation. This procedure, which is necessary for
equal feature contribution, faster convergence, and improved model performance, necessitates a careful decision between Z-
Score Normalisation and Min-Max Scaling based on the particulars of the data. Both strategies have trade-offs, such as
increased complexity and possible performance consequences, even though they offer advantages like clustered indexes and
faster searches. Making an informed choice between normalisation techniques depends on having a solid grasp of both the
nature of the data and the particular needs of the machine learning algorithm being used.