Unit 3 ML
Unit 3 ML
Neural networks - Neural Networks are computational models that mimic the complex
functions of the human brain. The neural networks consist of interconnected nodes or neurons that
process and learn from data, enabling tasks such as pattern recognition and decision making in
machine learning.
Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with three or
more layers, including an input layer, one or more hidden layers, and an output layer. It uses
nonlinear activation functions.
Recurrent Neural Network (RNN): An artificial neural network type intended for sequential
data processing is called a Recurrent Neural Network (RNN). It is appropriate for applications
where contextual dependencies are critical, such as time series prediction and natural language
processing, since it makes use of feedback loops, which enable information to survive within
the network.
Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to overcome
the vanishing gradient problem in training RNNs. It uses memory cells and gates to selectively
read, write, and erase information.
CNN
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input
will be an image or a sequence of images. This layer holds the raw input of the image with width
32, height 32, and depth 3.
Convolutional Layers: This is the layer, which is used to extract the feature from the input
dataset. It applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input image
data and computes the dot product between kernel weight and the corresponding input image
patch. The output of this layer is referred as feature maps. Suppose we use a total of 12 filters for
this layer we’ll get an output volume of dimension 32 x 32 x 12.
Stride is the number of pixels which are shift over the input matrix. When the stride is equaled to 1, then
we move the filters to 1 pixel at a time and similarly, if the stride is equaled to 2, then we move the filters to
2 pixels at a time.
Padding Sometimes filter does not fit perfectly fit the input image.
We have two options:
Pad the picture with zeros (zero-padding) so that it fits
Drop the part of the image where the filter did not fit.
This is called valid padding which keeps only valid part of the image
Pooling layer: This layer is periodically inserted in the convnet and its main function is to reduce
the size of volume which makes the computation fast reduces memory and also prevents
overfitting. Two common types of pooling layers are max pooling and average pooling. If we use
a max pool with 2 x 2 filters and stride 2, the resultant volume will be of dimension 16x16x12.
Pooling layer plays an important role in pre-processing of an image. Pooling layer reduces the
number of parameters when the images are too large. Pooling is "downscaling" of the image
obtained from the previous layers. It can be compared to shrinking an image to reduce its pixel
density..
Pooling is reduces the dimensionality of each feature map but remain the important
information. Since the large number of hidden layer required to learn the complex relation
present in the input image . so we apply pooling to reduce the feature representation.
Pooling layers section would reduce the number of parameters when the images are too
large. Spatial pooling also called subsampling or downsampling which reduces the
dimensionality of each map but retains important information. Spatial pooling can be of
different types:
Max pooling takes the largest element from the rectified feature map. Taking the largest element could
also take the average pooling. Sum of all elements in the feature map call as sum pooling
Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for
categorization or regression.
Fully Connected Layers: In this layer every neuron in one layer is connected to the every neuron
to the another layer. The aim of fully connected layer is to use high level feature map produced by
convolution and pooling layer for classifying the input image into various classes based on the
training dataset. takes the input from the previous layer and computes the final classification or
regression task.
Output Layer: The output from the fully connected layers is then fed into a logistic function for
classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.
Types of Layers in Neural Network
The Neural Network is constructed from 3 type of layers: Input layer —
initial data for the neural network. Hidden layers — intermediate layer
between input and output layer and place where all the computation is
done. Output layer — produce the result for given inputs
Q .What do you mean by the term convolution layer, pooling layer, loss layer, dense layer?
Describe each one in brief .
5.Padding and Stride: Padding adds zeros around input data to preserve spatial
information at edges. Stride determines the step size of the filter across the input,
controlling the spatial dimensions of the output feature maps.
Pooling Layer
In Convolutional Neural Networks (CNNs), the output feature maps from the
convolutional layers are down sampled by using pooling layers.
The main purpose of pooling is to reduce the size of feature maps, which in turn makes
computation faster. Pooling layers section would reduce the number of
parameters and maintain the most relevant information.
Spatial pooling also called subsampling or down sampling which reduces the dimensionality
of each map but retains important information. Spatial pooling can be of different types:
Max Pooling Average Pooling Sum Pooling
Max pooling - is a pooling operation that selects the maximum element from the region
of the feature map covered by the filter the summary of the features in a region is
represented by the maximum value in that region. It is mostly used when the image has a
dark background since max pooling will select brighter pixels.
Min Pooling - In this type of pooling, the summary of the features in a region is represented
by the minimum value in that region. It is mostly used when the image has a light background
since min pooling will select darker pixels.
Average Pooling - In the third type of pooling, the summary of the features in a region are
represented by the average value of that region. Average pooling smooths the harsh edges of a
picture and is used when such edges are not important.
Global Pooling - Maximum or average value over the full spatial dimension of the input
feature map is calculated using global pooling. Global pooling is often used to prepare the data
from a convolutional layer to be utilized in a fully connected layer
Translation invariance: Pooling layers are also useful in achieving translation invariance in the
feature maps. This means that the position of an object in the image does not affect the
classification result, as the same features are detected regardless of the position of the object.
Feature selection: Pooling layers can also help in selecting the most important features from the
input, as max pooling selects the most salient features and average pooling preserves more
information.
Loss Layer - The loss functions are used in the output layer to calculate the deviation between the output
that is predicted and the actual output. Depending upon the usage, we use different loss functions.
Softmax Loss Function/Cross-Entropy: It is used for measuring the model performance. It generates
independent probability values within the probability distribution of [0,1].
The loss layer, also known as the cost function or objective function, is a crucial component of a machine
learning model, particularly in supervised learning tasks such as classification or regression
Dense Layer -
A dense layer, also known as a fully connected layer, is a type of neural network layer where
each neuron is connected to every neuron in the previous layer. Dense layers are fundamental
building blocks in feedforward neural networks, including multilayer perceptrons (MLPs) .
Dense layers are crucial for learning complex patterns in data and are commonly used in the
final stages of deep learning models for tasks like classification and regression
Keras - is one of the most powerful and easy to use python library, which is
built on top of popular deep learning libraries like TensorFlow, Theano, etc.,
for creating deep learning models.
Keras is an open-source high-level Neural Network library, which is written in Python is capable
enough to run on Theano, TensorFlow, or CNTK. It was developed by one of the Google engineers,
Francois Chollet. It is made user-friendly, extensible, and modular for facilitating faster
experimentation with deep neural networks. It not only supports Convolutional Networks and
Recurrent Networks individually but also their combination.
Applications of Keras
Keras is used for creating deep models which can be productized on smartphones.
Keras is also extensively used in deep learning competitions to create and deploy working
models, which are fast in a short amount of time.
Certainly! Keras is a high-level neural networks API written in Python that works as an interface for
artificial neural networks. It's known for its simplicity, modularity, and ease of use. Here are some
key features of the Keras framework:
User-Friendly API: Keras provides a simple and intuitive interface that makes it easy to design,
build, and experiment with neural network models. Its user-friendly design is particularly beneficial
for beginners and researchers.
Modularity: Keras enables building neural networks using a modular approach. Neural network
architectures can be constructed by assembling individual layers, allowing for easy
experimentation and customization.
Compatibility with Multiple Backends: Keras is compatible with multiple deep learning backend
engines, including TensorFlow, Theano, and Microsoft Cognitive Toolkit (CNTK). This flexibility
allows users to choose the backend that best suits their needs
Support for Convolutional and Recurrent Networks: Keras provides built-in support for
building Convolutional Neural Networks (CNNs) for tasks such as image classification and object
detection, as well as Recurrent Neural Networks (RNNs) for tasks such as sequence modeling and
natural language processing.
Ease of Prototyping: Keras allows for rapid prototyping of neural network architectures by
providing a wide range of pre-built layers, activation functions, optimizers, and loss functions. This
enables users to quickly experiment with different configurations and hyperparameters.
Visualization Tools: Keras provides built-in utilities for visualizing neural network architectures,
training/validation curves, and model performance metrics. These visualization tools aid in
understanding and debugging neural network models.
Easy Model Saving and Loading: Keras makes it simple to save trained models to disk and load
them for inference or further training. Models can be saved in various formats, including HDF5 and
JSON, making them compatible with different platforms and environments.
Customizability: Keras allows users to define custom layers, loss functions, metrics, and callbacks,
enabling advanced customization and integration of domain-specific requirements into neural
network models.
1X1 Convolution
A problem with deep convolutional neural networks is that the number of feature maps often
increases with the depth of the network. This problem can result in a dramatic increase in the
number of parameters and computation required when larger filter sizes are used, such as 5×5 and
7×7.
To address this problem, a 1×1 convolutional layer can be used that offers a channel-wise pooling,
often called feature map pooling or a projection layer. This simple technique can be used for
dimensionality reduction, decreasing the number of feature maps whilst retaining their salient
features. It can also be used directly to create a one-to-one projection of the feature maps to pool
features across channels or to increase the number of feature maps, such as after traditional
pooling layers.
A filter applied to an input image or feature map always results in a single number.
Systematic application of the filter from left to right and top to bottom creates a two-
dimensional feature map. Each filter produces one corresponding feature map.
The filter must match the depth (number of channels) of the input. Regardless of the input
and filter depth, the output is a single number, creating a feature map with a single
channel.
Concrete examples:
For a grayscale image (one channel), a 3×3 filter is applied in 3x3x1 blocks.
For a color image with three channels (red, green, blue), a 3×3 filter is applied in
3x3x3 blocks.
For a block of feature maps with a depth of 64 from another layer, a 3×3 filter is
applied in 3x3x64 blocks to create the single values for the output feature map.
The depth of the output of one convolutional layer is defined only by the number of parallel
filters applied to the input.
Pooling layers reduce spatial dimensions but not the number of feature maps. Thus, a
method to reduce the depth is needed.
Down sample Feature Maps With 1×1 Filters
A 1×1 convolutional layer helps by:
Dimensionality Reduction: Reducing the number of feature maps while retaining
important features.
Efficiency: Each 1×1 filter has one weight per input channel, acting like a neuron
across the input feature maps.
Nonlinearity: Applying nonlinear functions enables complex transformations.
This simple method summarizes input feature maps and allows control over the depth of
feature maps. It can be used to increase or decrease the number of feature maps as
needed, often referred to as a projection layer or channel pooling layer.
Inception Blocks
Conventional convolutional neural networks typically use convolutional and pooling layers
to extract features from the input data. However, these networks are limited in capturing
local and global features, as they typically focus on either one or the other. The inception
blocks in the InceptionNet architecture are intended to solve the problem of learning a
combination of local and global features from the input data.
Inception blocks address this problem using a modular design that allows the network to
learn a variety of feature maps at different scales. These feature maps are
then concatenated together to form a more comprehensive representation of the input
data. This allows the network to capture a wide range of features, including both low-level
and high-level features, which can be useful for tasks such as image classification.
By using inception blocks, the Inception Net architecture can learn a more comprehensive
set of features from the input data, which can improve the network's performance on tasks
such as image classification.
The below image is the “naive” inception module. It performs convolution on an input,
with 3 different sizes of filters (1x1, 3x3, 5x5). Additionally, max pooling is also performed.
The outputs are concatenated and sent to the next inception module.
inception network is often difficult to determine the best filters sizes for your network and whether to use
polling layers. to overcome this inception architecture uses many different filter sizes and pooling layers in
parallel, the output of which are concatenated and inputted to the next block in this way the network
chooses which filter sizes or combination use. To solve the problem of a large computational cost the
inception network utilises 1X1 convolution to shrink the volume of the next layer.
As stated before, deep neural networks are computationally expensive. To make it
cheaper, the authors limit the number of input channels by adding an extra 1x1
convolution before the 3x3 and 5x5 convolutions. Though adding an extra operation
may seem counterintuitive, 1x1 convolutions are far more cheaper than 5x5
convolutions, and the reduced number of input channels also help. Do note that
however, the 1x1 convolution is introduced after the max pooling layer, rather than
before.
How does an Inception Module Function Work?
An Inception Module is a building block used in the Inception network architecture
for CNNs.
It improves performance by allowing multiple parallel convolutional filters to be
applied to the input data.
The basic structure of an Inception Module is a combination of multiple convolutional
filters of different sizes applied in parallel to the input data.
The filters may have different kernel sizes (e.g. 3x3, 5x5) and/or different strides
(e.g. 1x1, 2x2).
Output of each filter is concatenated together to form a single output feature map.
Inception Module also includes a max pooling layer, which takes the maximum value
from a set of non-overlapping regions of the input data.
This reduces the spatial dimensionality of the data and allows for translation
invariance.
The use of multiple parallel filters and max pooling layers allows the Inception
Module to extract features at different scales and resolutions, improving the
network's ability to recognize patterns in the input data.
In summary, the Inception module improves feature extraction, improving the
network's performance.
Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it
provides similar information." These techniques are widely used in machine
learning for obtaining a better fit predictive model while solving the classification and
regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
Feature extraction: This process is also termed feature projection, wherein multidimensional space
is converted into a space with lower dimensions. Some known feature extraction methods include principal
component analysis (PCA), linear discriminant analysis (LDA), Kernel PCA (K-PCA), and quadratic
discriminant analysis (QCA).
Principal Component Analysis is a statistical process that converts the observations of correlated
features into a set of linearly uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components.
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels.
It works on the condition that while the data in a higher dimensional space is mapped to data in a
lower dimension space, the variance of the data in the lower dimensional space should be
maximum.
The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a dataset
while preserving the most important patterns or relationships between the variables without any
prior knowledge of the target variables.
PCA tends to find linear correlations between variables, which is sometimes undesirable.
PCA fails in cases where mean and covariance are not enough to define datasets.
Covariance measures the strength of joint variability between two or more variables, indicating how much they
change in relation to each other. The covariance between two variables measures how they change together. The
covariance matrix for a dataset with n features is an n x n matrix that summarizes the relationships between all pairs
of features.
The next step is to compute the eigenvalues and eigenvectors of the covariance matrix. These eigenvalues represent
the amount of variance explained by each eigenvector (principal component). Eigenvalues and eigenvectors are
mathematical concepts related to linear transformations and matrices. In the context of PCA, they play a central role
in identifying the principal components. Here’s what they mean:
Eigenvalue: An eigenvalue (λ) represents a scalar that indicates how much variance is explained by the
corresponding eigenvector. In PCA, eigenvalues quantify the importance of each principal component.
They are always non-negative, and the eigenvalue corresponding to a principal component measures the
proportion of the total variance in the data explained by that component.
Eigenvector: An eigenvector (v) is a vector associated with an eigenvalue. In PCA, eigenvectors represent
the directions along which the data varies the most. Each eigenvector points in a specific direction in the
feature space and corresponds to a principal component. Eigenvectors are typically normalized, meaning
their length is 1.
Step 4: Sorting Eigenvalues and Eigenvectors
To identify the most significant principal components, sort the eigenvalues in descending
order. The corresponding eigenvectors are also sorted accordingly. The first principal
component explains the most variance, the second explains the second most, and so on.
Choose a subset of the top k eigenvectors to form a transformation matrix. After computing
the eigenvalues and eigenvectors of the covariance matrix, they are sorted in descending
order based on the magnitude of their eigenvalues. The principal components are then
selected from the top eigenvectors. The first principal component corresponds to the
eigenvector with the largest eigenvalue, the second principal component corresponds to the
eigenvector with the second-largest eigenvalue, and so on. These principal components are
orthogonal, meaning they are uncorrelated. This matrix is used to project the original data
into a lower-dimensional space, resulting in the reduced dataset.
One-hot encoding and label encoding are two commonly used techniques for representing
categorical data in machine learning and data analysis. Both techniques are used to convert
categorical variables into a format that can be provided to machine learning algorithms.
1. Standardize the data: PCA requires standardized data, so the first step is to standardize the data to
ensure that all variables have a mean of 0 and a standard deviation of 1.
2. Calculate the covariance matrix: The next step is to calculate the covariance matrix of the
standardized data. This matrix shows how each variable is related to every other variable in the
dataset.
3. Calculate the eigenvectors and eigenvalues: The eigenvectors and eigenvalues of the covariance
matrix are then calculated. The eigenvectors represent the directions in which the data varies the
most, while the eigenvalues represent the amount of variation along each eigenvector.
4. Choose the principal components: The principal components are the eigenvectors with the highest
eigenvalues. These components represent the directions in which the data varies the most and are
used to transform the original data into a lower-dimensional space.
5. Transform the data: The final step is to transform the original data into the lower-dimensional
space defined by the principal components.
Label Encoding:
Label encoding involves assigning a unique integer value to each category in a categorical variable.
Each category is represented by an integer, starting from 0 or 1 and incrementing by 1 for each
subsequent category. For example:
Category A: 0
Category B: 1
Category C: 2
Label encoding doesn't change the dimensionality of the data, as it replaces each category with a
single integer value. However, it introduces ordinality, meaning that the numeric values imply an
order or hierarchy among the categories. This may not always be desirable, especially for nominal
categorical variables where there is no inherent order.
One-Hot Encoding:
One-hot encoding, on the other hand, expands the categorical variable into a binary matrix where
each category is represented by a binary vector. In this encoding scheme, each category is
represented by a vector of length equal to the number of unique categories. The vector has a
value of 1 at the index corresponding to the category and 0 at all other indices. For example:
Category A: [1, 0, 0]
Category B: [0, 1, 0]
Category C: [0, 0, 1]
One-hot encoding increases the dimensionality of the data because each unique category is
represented by its own binary feature. If a categorical variable has n unique categories, one-hot
encoding will result in n new binary features. This can lead to a significant increase in the number
of features, especially for variables with many unique categories. However, one-hot encoding
ensures that the categorical variables are treated as independent binary features, without implying
any ordinal relationship between the categories.
Q. break down how cnn actually operates. The image is downloaded , and the number of filters is
increased as we approach the model output ,but why
Image is downloaded:
When dealing with CNNs, especially in image processing tasks, the images are
typically preprocessed before being fed into the network. This preprocessing might
involve downloading the images from a dataset or an external source.
Once the images are downloaded, they are usually resized, normalized, and
sometimes augmented to prepare them for input into the neural network. This
preprocessing step ensures that the images are in a consistent format and are
suitable for training or inference.
Transfer Learning –
Transfer learning, used in machine learning, is the reuse of a pre-trained model on a
new problem. In transfer learning, a machine exploits the knowledge gained from a
previous task to improve generalization about another.
Transfer Learning refers to the set of methods that allow transferring knowledge
gained from solving specific problems to address another problem.
Transfer learning is a powerful technique in deep learning that allows us to leverage the
knowledge gained from one task to improve performance on another related task. This is
especially useful in deep learning because training deep neural networks can be
computationally expensive and time-consuming and also if you have not large amount of
data, then in that case you will not be able to train your model from scratch. By using
transfer learning, we can start with a pretrained model that has already learned general
features that are useful for many different tasks. We can then fine-tune this model on our
target task with less data and less training time.
Transfer learning is a technique in machine learning where a model trained on one task is
used as the starting point for a model on a second task. This can be useful when the second
task is similar to the first task, or when there is limited data available for the second task. By
using the learned features from the first task as a starting point, the model can learn more
quickly and effectively on the second task. This can also help to prevent overfitting, as the
model will have already learned general features that are likely to be useful in the second
task.
How does Transfer Learning work?
This is a general summary of how transfer learning works:
Pre-trained Model: Start with a model that has previously been trained for a certain
task using a large set of data. Frequently trained on extensive datasets, this model
has identified general features and patterns relevant to numerous related jobs.
Base Model: The model that has been pre-trained is known as the base model. It is
made up of layers that have utilized the incoming data to learn hierarchical feature
representations.
Transfer Layers: In the pre-trained model, find a set of layers that capture generic
information relevant to the new task as well as the previous one. Because they are
prone to learning low-level information, these layers are frequently found near the
top of the network.
Fine-tuning: Using the dataset from the new challenge to retrain the chosen layers.
We define this procedure as fine-tuning. The goal is to preserve the knowledge from
the pre-training while enabling the model to modify its parameters to better suit the
demands of the current assignment.
Ways of doing Transfer Learning :
2. Fine Tuning : In fine-tuning, we train the last few convolutional layers of the
pretrained model and then add a new fully connected layer. We also train
the fully connected layer. This method is useful when the labels for our
image classification task are new and not present in the dataset used to train
the pretrained model. We keep the weights of the first few convolutional
layers fixed and only train the last few layers and the fully connected layer.
When a model performs very well for training data but has poor performance with test data (new data), it is known as
overfitting. In this case, the machine learning model learns the details and noise in the training data such that it
negatively affects the performance of the model on test data. Overfitting can happen due to low bias and high
variance.
Underfitting - When a model has not learned the patterns in the training data well and is unable to
generalize well on the new data, it is known as underfitting. An underfit model has poor performance on the
training data and will result in unreliable predictions. Underfitting occurs due to high bias and low variance.
Reasons for Underfitting
1. The model is too simple, So it may be not capable to represent the complexities in the
data.
2. The input features which is used to train the model is not the adequate representations of
underlying factors influencing the target variable.
3. The size of the training dataset used is not enough.
4. Excessive regularization are used to prevent the overfitting, which constraint the model
to capture the data well.
5. Features are not scaled.
Techniques to Reduce Underfitting
1. Increase model complexity.
2. Increase the number of features, performing feature engineering.
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better results.
Overfitting
Definition: Overfitting occurs when a model learns the training data too well, including its noise
and details, which negatively impacts its performance on new, unseen data.
Indicators:
1. High training accuracy and low validation accuracy: The model performs very well on training
data but poorly on validation data.
2. Large gap between training and validation loss: The training loss is much lower than the
validation loss.
In this example, the high training accuracy and low validation accuracy, along with the significant
gap between training and validation loss, suggest overfitting. The model memorizes the training
data but fails to generalize to new data.
Underfitting
Definition: Underfitting occurs when a model is too simple to capture the underlying patterns in
the data, leading to poor performance on both training and validation datasets.
Indicators:
1. Low training and validation accuracy: The model performs poorly on both training and validation
data.
2. High training and validation loss: The losses remain high for both training and validation datasets.
In this example, the low accuracy and high loss for both training and validation data indicate that
the model is underfitting. It fails to capture the essential patterns in the data.
Overfitting occurs when your neural network learns too much from the training
data and fails to generalize to new or unseen data. Underfitting occurs when
your neural network learns too little from the training data and performs
poorly on both the training and the validation data. Both overfitting and
underfitting can reduce your accuracy .