Project Exhibition 2
Project Exhibition 2
What is cnn
Geeks for geeks
A Convolutional Neural Network (CNN) is a type of Deep Learning neural
network architecture commonly used in Computer Vision. Computer vision is a field
of Artificial Intelligence that enables a computer to understand and interpret the
image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really
well. Neural Networks are used in various datasets like images, audio, and text.
Different types of Neural Networks are used for different purposes, for example for
predicting the sequence of words we use Recurrent Neural Networks more
precisely an LSTM, similarly for image classification we use Convolution Neural
networks. In this blog, we are going to build a basic building block for CNN.
In a regular Neural Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model. The
number of neurons in this layer is equal to the total number of features in
our data (number of pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then fed into the hidden
layer. There can be many hidden layers depending on our model and data
size. Each hidden layer can have different numbers of neurons which are
generally greater than the number of features. The output from each layer
is computed by matrix multiplication of the output of the previous layer
with learnable weights of that layer and then by the addition of learnable
biases followed by activation function which makes the network
nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic
function like sigmoid or softmax which converts the output of each class
into the probability score of each class.
The data is fed into the model and output from each layer is obtained from the above
step is called feedforward, we then calculate the error using an error function, some
common error functions are cross-entropy, square loss error, etc. The error function
measures how well the network is performing. After that, we backpropagate into the
model by calculating the derivatives. This step is called Backpropagation which
basically is used to minimize the loss.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial neural
networks (ANN) which is predominantly used to extract the feature from the grid-
like matrix dataset. For example visual datasets like images or videos where data
patterns play an extensive role.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer,
Convolutional layer, Pooling layer, and fully connected layers.
The Convolutional layer applies filters to the input image to extract features, the
Pooling layer downsamples the image to reduce computation, and the fully
connected layer makes the final prediction. The network learns the optimal filters
through backpropagation and gradient descent.
How Convolutional Layers works
Convolution Neural Networks or covnets are neural networks that share their
parameters. Imagine you have an image. It can be represented as a cuboid having its
length, width (dimension of the image), and height (i.e the channel as images
generally have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network,
called a filter or kernel on it, with say, K outputs and representing them vertically.
Now slide that neural network across the whole image, as a result, we will get
another image with different widths, heights, and depths. Instead of just R, G, and B
channels now we have more channels but lesser width and height. This operation is
called Convolution. If the patch size is the same as that of the image it will be a
regular neural network. Because of this small patch, we have fewer weights.
Image source: Deep Learning Udacity
Now let’s talk about a bit of mathematics that is involved in the whole convolution
process.
Convolution layers consist of a set of learnable filters (or kernels) having
small widths and heights and the same depth as that of input volume (3 if
the input layer is image input).
For example, if we have to run convolution on an image with dimensions
34x34x3. The possible size of filters can be axax3, where ‘a’ can be
anything like 3, 5, or 7 but smaller as compared to the image dimension.
During the forward pass, we slide each filter across the whole input
volume step by step where each step is called stride (which can have a
value of 2, 3, or even 4 for high-dimensional images) and compute the dot
product between the kernel weights and patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack
them together as a result, we’ll get output volume having a depth equal to
the number of filters. The network will learn all the filters.
Layers used to build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. A
covnets is a sequence of layers, and every layer transforms one volume to another
through a differentiable function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
Input Layers: It’s the layer in which we give input to our model. In CNN,
Generally, the input will be an image or a sequence of images. This layer
holds the raw input of the image with width 32, height 32, and depth 3.
Convolutional Layers: This is the layer, which is used to extract the
feature from the input dataset. It applies a set of learnable filters known as
the kernels to the input images. The filters/kernels are smaller matrices
usually 2×2, 3×3, or 5×5 shape. it slides over the input image data and
computes the dot product between kernel weight and the corresponding
input image patch. The output of this layer is referred ad feature maps.
Suppose we use a total of 12 filters for this layer we’ll get an output
volume of dimension 32 x 32 x 12.
Activation Layer: By adding an activation function to the output of the
preceding layer, activation layers add nonlinearity to the network. it will
apply an element-wise activation function to the output of the convolution
layer. Some common activation functions are RELU: max(0, x),
Tanh, Leaky RELU, etc. The volume remains unchanged hence output
volume will have dimensions 32 x 32 x 12.
Pooling layer: This layer is periodically inserted in the covnets and its
main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting. Two
common types of pooling layers are max pooling and average pooling. If
we use a max pool with 2 x 2 filters and stride 2, the resultant volume will
be of dimension 16x16x12.
Output Layer: The output from the fully connected layers is then fed into
a logistic function for classification tasks like sigmoid or softmax which
converts the output of each class into the probability score of each class.
Example:
Let’s consider an image and apply the convolution layer, activation layer, and
pooling layer operation to extract the inside feature.
Input image:
Input image
Step:
import the necessary libraries
set the parameter
define the kernel
Load the image and plot it.
Reformat the image
Apply convolution layer operation and plot the output image.
Apply activation layer operation and plot the output image.
Apply pooling layer operation and plot the output image.
Frequently Asked Questions (FAQs)
1: What is a Convolutional Neural Network (CNN)?
A Convolutional Neural Network (CNN) is a type of deep learning neural network
that is well-suited for image and video analysis. CNNs use a series of convolution
and pooling layers to extract features from images and videos, and then use these
features to classify or detect objects or scenes.
2: How do CNNs work?
CNNs work by applying a series of convolution and pooling layers to an input image
or video. Convolution layers extract features from the input by sliding a small filter,
or kernel, over the image or video and computing the dot product between the filter
and the input. Pooling layers then downsample the output of the convolution layers
to reduce the dimensionality of the data and make it more computationally efficient.
3: What are some common activation functions used in CNNs?
Some common activation functions used in CNNs include:
Rectified Linear Unit (ReLU): ReLU is a non-saturating activation
function that is computationally efficient and easy to train.
Leaky Rectified Linear Unit (Leaky ReLU): Leaky ReLU is a variant of
ReLU that allows a small amount of negative gradient to flow through the
network. This can help to prevent the network from dying during training.
Parametric Rectified Linear Unit (PReLU): PReLU is a generalization of
Leaky ReLU that allows the slope of the negative gradient to be learned.
4: What is the purpose of using multiple convolution layers in a CNN?
Using multiple convolution layers in a CNN allows the network to learn increasingly
complex features from the input image or video. The first convolution layers learn
simple features, such as edges and corners. The deeper convolution layers learn
more complex features, such as shapes and objects.
5: What are some common regularization techniques used in CNNs?
Regularization techniques are used to prevent CNNs from overfitting the training
data. Some common regularization techniques used in CNNs include:
Dropout: Dropout randomly drops out neurons from the network during
training. This forces the network to learn more robust features that are not
dependent on any single neuron.
L1 regularization: L1 regularization regularizes the absolute value of the
weights in the network. This can help to reduce the number of weights and
make the network more efficient.
L2 regularization: L2 regularization regularizes the square of the weights
in the network. This can also help to reduce the number of weights and
make the network more efficient.
6: What is the difference between a convolution layer and a pooling layer?
A convolution layer extracts features from an input image or video, while a
pooling layer downsamples the output of the convolution layers. Convolution
layers use a series of filters to extract features , while pooling layers use a variety
of techniques to downsample the data, such as max pooling and average pooling.
Artificial Intelligence and Machine Learning based Image
Processing
By V Srinivas Durga Prasad, Softnautics
Image processing is the process of converting an image to a digital format and then
performing various operations on it to gather useful information. Artificial Intelligence
(AI) and Machine Learning (ML) has had a huge influence on various fields of technology
in recent years. Computer vision, the ability for computers to understand images and
videos on their own, is one of the top trends in this industry. The popularity of computer
vision is growing like never before and its application is spanning across industries like
automobiles, consumer electronics, retail, manufacturing and many more. Image
processing can be done in two ways: Physical photographs, printouts, and other hard
copies of images being processed using analogue image processing and digital image
processing is the use of computer algorithms to manipulate digital images. The input in
both cases is an image. The output of analogue image processing is always an image.
However, the output of digital image processing may be an image or information
associated with that image, such as data on features, attributes, and bounding boxes.
According to a report published by Data Bridge Market Research analyses, the Image
processing systems market is expected to grow at a CAGR of 21.8% registering a market
value of USD 151,632.6 million by 2029. Image processing is used in a variety of use
cases today, including visualisation, pattern recognition, segmentation, image
information extraction, classification, and many others.
The initial level begins with image pre-processing which uses a sensor to capture the
image and transform it into a usable format.
Enhancement of image
Image enhancement is the technique of bringing out and emphasising specific interesting
characteristics which are hidden in an image.
Restoration of image
This enables adjustments to image resolution and size, whether for image reduction or
restoration depending on the situation, without lowering image quality below a desirable
level. Lossy and lossless compression techniques are the two main types of image file
compression which are being employed in this stage.
Morphological processing
Digital images are processed depending on their shapes using an image processing
technique known as morphological operations. The operations depend on the pixel values
rather than their numerical values, and well suited for the processing of binary images.
It aids in removing imperfections for structure of the image.
The segmentation process divides a picture into segments, and each segment is
represented and described in such a way that it can be processed further by a computer.
The image's quality and regional characteristics are covered by representation. The
description's job is to extract quantitative data that helps distinguish one class of items
from another.
Recognition of image
A label is given to an object through recognition based on its description. Some of the
often-employed algorithms in the process of recognising images include the Scale-
invariant Feature Transform (SIFT), the Speeded Up Robust Features (SURF), and the
PCA (Principal Component Analysis).
OpenCV is a well-known computer vision library that provides numerous algorithms and
utilities to support the algorithms. The modules for object detection, machine learning,
and image processing are only a few of the many that it includes. With the help of this
programme, you may do picture processing tasks like data extraction, restoration, and
compression.
TensorFlow
Intended to shorten the time it takes to get from a research prototype to commercial
development, it includes features like a tool and library ecosystem, support for popular
cloud platforms, a simple transition from development to production, distribution
training, etc.
Caffe
Applications
Machine vision
The ability of a computer to comprehend the world is known as machine vision. Digital
signal processing and analogue-to-digital conversion are combined with one or more
video cameras. The image data is transmitted to a robot controller or computer. This
technology aids companies in improving automated processes through automated
analysis. For instance, specialised machine vision image processing methods can
frequently sort parts more efficiently when tactile methods are insufficient for robotic
systems to sort through various shapes and sizes of parts. These methods use very
specific algorithms that consider the parameters of the colours or greyscale values in the
image to accurately define outlines or sizing for an object.
Pattern recognition
The technique of identifying patterns with the aid of a machine learning system is called
pattern recognition. The classification of data generally takes place based on previously
acquired knowledge or statistical data extrapolated from patterns and/or their
representation. Image processing is used in pattern recognition to identify the items in
an image, and machine learning is then used to train the system to recognise changes in
patterns. Pattern recognition is utilised in computer assisted diagnosis, handwriting
recognition, image identification, character recognition etc.
Today, thanks to technological advancements, we can instantly view live CCTV footage
or video feeds from anywhere in the world. This indicates that image transmission and
encoding have both advanced significantly. Progressive image transmission is a
technique of encoding and decoding digital information representing an image in a way
that the image's main features, like outlines, can be presented at low resolution initially
and then refined to greater resolutions. An image is encoded by an electronic analogue
to multiple scans of the exact image at different resolutions in progressive transmission.
Progressive image decoding results in a preliminary approximate reconstruction of the
image, followed by successively better images whose adherence is gradually built up
from succeeding scan results at the receiver side. Additionally, image compression
reduces the amount of data needed to describe a digital image by eliminating extra data,
ensuring that the image processing is finished and that it is suitable for transmission.
Here, the terms "image sharpening" and "restoration" refer to the processes used to
enhance or edit photographs taken with a modern camera to produce desired results.
Zooming, blurring, sharpening, converting from grayscale to colour, identifying edges
and vice versa, image retrieval, and image recognition are included. Recovering lost
resolution and reducing noise are the goals of picture restoration techniques. Either the
frequency domain or the image domain is used for image processing techniques.
Deconvolution, which is carried out in the frequency domain, is the easiest and most
used technique for image restoration.
Sift
SIFT Algorithm: How to Use SIFT for Image Matching in
Python
Introduction
invariant feature transform) sift computer vision and understanding. The more
number of times you see something, the easier it is for you to recollect it. Also,
every time an image pops up in your mind, it relates that item or image to a
bunch of related images or things. What if I told you we could teach a machine to
While humans can easily identify objects in images despite variations in angle or
scale, machines struggle with this task. However, through machine learning, we
computer vision an exciting field to work in. In this tutorial, we will discuss SIFT –
identify key features in images and match these features to a new image of the
same object.
Learning Objectives
Transform) technique.
Learn how to perform Feature Matching using the scale invariant feature
transform algorithm.
Python.
Table of contents
What Is SIFT Algorithm?
technique used for feature detection and description. It detects distinctive key
points or features in an image that are robust to changes in scale, rotation, and
identifying key points based on their local intensity extrema and computing
descriptors that capture the local image information around those key points.
These descriptors can then be used for tasks like image matching, object
Take a look at the below collection of images and think of the common element
between them:
The resplendent Eiffel Tower, of course! The keen-eyed among you will also
have noticed that each image has a different background, is captured from
different angles, and also has different objects in the foreground (in some cases).
I’m sure all of this took you a fraction of a second to figure out. It doesn’t matter if
the image is rotated at a weird angle or zoomed in to show only half of the
Tower. This is primarily because you have seen the images of the Eiffel Tower
multiple times, and your memory easily recalls its features. We naturally
understand that the scale or angle of the image may change, but the object
them to identify the object in an image if we change certain things (like the angle
or the scale). Here’s the good news – machines are super flexible, and we can
SIFT algorithm helps locate the local features in an image, commonly known as
the ‘keypoints‘ of the image. These keypoints are scale & rotation invariants that
can be used for various computer vision applications, like image matching, object
We can also use the keypoints generated using SIFT computer vision as features
for the image during model training. The major advantage of SIFT features,
over-edge features, or hog features is that they are not affected by the size
For example, here is another image of the Eiffel Tower along with its smaller
version. The keypoints of the object in the first image are matched with the
keypoints found in the second image. The same goes for two images when the
used to ensure the scale and rotation invariance are. Broadly speaking, the
This article is based on the original paper by David G. Lowe. Here is the
We need to identify the most distinct features in a given input image while
ignoring any noise. Additionally, we need to ensure that the features are not
scale-dependent. These are critical concepts, so let’s talk about them one by
one.
Gaussian Blur
For every pixel in an image, the Gaussian Blur calculates a value based on its
before and after applying the Gaussian Blur. As you can see, the texture and
minor details are removed from the image, and only the relevant information, like
Gaussian Blur helped in image processing and successfully removed the noise
from the images, and we have highlighted the important features of the image.
Now, we need to ensure that these features are scale-dependent. This means
space’.
Scale space is a collection of images having different scales, generated from a
single image.
Hence, these blur images are created for multiple scales. To create a new set of
images of different scales, we will take the original image and reduce the scale
by half. For each new image, we will create blur versions as we saw above.
image of size (275, 183) and a scaled image of dimension (138, 92). For both
You might be thinking – how many times do we need to scale the image, and
how many subsequent blur images need to be created for each scaled
image? The ideal number of octaves should be four, and for each octave, the
used Gaussian blur for each of them to reduce the noise in the image. Next, we
will try to enhance the features using a technique called the Difference of
Gaussians or DoG.
subtraction of one blurred version of an original image from another, less blurred
DoG creates another set of images, for each octave, by subtracting every image
from the previous image in the same scale. Here is a visual explanation of how
DoG is implemented:
Note: The image is taken from the original paper. The octaves are now
Let us create the DoG for the images in scale space. Take a look at the below
diagram. On the left, we have 5 images, all from the first octave (thus having the
same scale). Each subsequent image is created by applying the Gaussian blur
implementing it only for the first octave, but the same process happens for all the
octaves.
Now that we have a new set of images, we are going to use this to find the
important keypoints.
Keypoint Localization
Once the images have been created, the next step is to find the important
keypoints from the image that can be used for feature matching. The idea is to
find the local maxima and minima for the images. This part is divided into two
steps:
1. Find the local maxima and minima
To locate the local maxima and minima, we go through every pixel in the image
When I say ‘neighboring’, this includes not only the surrounding pixels of that
image (in which the pixel lies) but also the nine pixels for the previous and next
This means that every pixel value is compared with 26 other pixel values to find
whether it is the local maxima/minima called extrema. For example, in the below
diagram, we have three images from the first octave. The pixel marked x is
compared with the neighboring pixels (in green) and is selected as a keypoint or
We now have potential keypoints that represent the images and are scale-
invariant. We will apply the last check over the selected keypoints to ensure that
Keypoint Selection
Kudos! So far, we have successfully generated scale-invariant keypoints. But
some of these keypoints may not be robust to noise. This is why we need to
perform a final check to make sure that we have the most accurate keypoints to
Hence, we will eliminate the keypoints that have low contrast or lie very
computed for each keypoint. If the resulting value is less than 0.03 (in
identify the poorly located keypoints. These are the keypoints that are close to
the edge and have a high edge response but may not be robust to a small
Now that we have performed both the contrast test and the edge test to reject the
unstable keypoints, we will now assign an orientation value for each keypoint to
Orientation Assignment
At this stage, we have a set of stable keypoints for the images. We will now
rotation. We can again divide this step into two smaller steps:
Let’s say we want to find the magnitude and orientation for the pixel value in red.
For this, we will calculate the gradients in the x and y directions by taking the
difference between 55 & 46 and 56 & 42. This comes out to be Gx = 9 and Gy =
14, respectively.
Once we have the gradients, we can find the magnitude and orientation using the
following formulas:
The magnitude represents the intensity of the pixel and the orientation gives the
We can now create a histogram given that we have these magnitude and
to 360. Since our angle value is 57, it will fall in the 6th bin. The 6th bin value will
be in proportion to the magnitude of the pixel, i.e. 16.64. We will do this for all
You can refer to this article for a much more detailed explanation for calculating
This histogram would peak at some point. The bin at which we see the peak
with the magnitude and scale the same as the keypoint used to generate the
histogram. And the angle or orientation will be equal to the new bin that has the
peak.
Effectively at this point, we can say that there can be a small increase in the
number of keypoints.
Keypoint Descriptor
This is the final step for SIFT(scale invariant feature transform) computer vision.
this section, we will use the neighboring pixels, their orientations, and their
Additionally, since we use the surrounding pixels, the descriptors will be partially
We will first take a 16×16 neighborhood around the keypoint. This 16×16 block is
further divided into 4×4 sub-blocks and for each of these sub-blocks, we
these arrows represents the 8 bins, and the length of the arrows defines the
magnitude. So, we will have a total of 128 bin values for every keypoint.
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
#reading image
img1 = cv2.imread('eiffel_2.jpeg')
gray1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
#keypoints
sift = cv2.xfeatures2d.SIFT_create()
keypoints_1, descriptors_1 = sift.detectAndCompute(img1,None)
img_1 = cv2.drawKeypoints(gray1,keypoints_1,img1)
plt.imshow(img_1)
Feature Matching
We will now use the SIFT computer vision features for feature matching. For this
purpose, I have downloaded two images of the Eiffel Tower taken from different
positions. You can try it with any two images that you want.
# read images
img1 = cv2.imread('eiffel_2.jpeg')
img2 = cv2.imread('eiffel_1.jpg')
ax[0].imshow(img1, cmap='gray')
ax[1].imshow(img2, cmap='gray')
Now, for both of these images, we are going to generate the SIFT features. First,
we have to construct a SIFT object. We first create a SIFT computer vision object
using sift_create and then use the function detectAndCompute to get the
keypoints. It will return two values – the keypoints and the sift computer vision
descriptors.
Let’s determine the keypoints and print the total number of keypoints found in
each image:
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
# read images
img1 = cv2.imread('eiffel_2.jpeg')
img2 = cv2.imread('eiffel_1.jpg')
#sift
sift = cv2.xfeatures2d.SIFT_create()
len(keypoints_1), len(keypoints_2)
283, 540
Next, let’s try and match the features from image 1 with features from image 2.
We will be using the function match() from the BFmatcher (brute force match)
module. Also, we will draw lines between the features that match both images.
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
# read images
img1 = cv2.imread('eiffel_2.jpeg')
img2 = cv2.imread('eiffel_1.jpg')
#sift
sift = cv2.xfeatures2d.SIFT_create()
#feature matching
bf = cv2.BFMatcher(cv2.NORM_L1, crossCheck=True)
matches = bf.match(descriptors_1,descriptors_2)
matches = sorted(matches, key = lambda x:x.distance)
number according to what you prefer. To find out how many keypoints are
matched, we can print the length of the variable matches. In this case, the
Conclusion
is a site that provides excellent visualization for each step of SIFT. You can add
your own image, and it will create the keypoints for that image as well. Check it
Robust Feature), which is simply a faster version of SIFT. I would encourage you
And if you’re new to the world of computer vision and image data, I recommend
Key Takeaways
SIFT (Scale-Invariant Feature Transform) is a powerful technique for image
matching that can identify and match features in images that are invariant to
The SIFT technique involves generating a scale space of images with different
scales and then using the Difference of Gaussian (DoG) method to identify
It also involves computing descriptors for each keypoint, which can be used for
It can be implemented using Python and the OpenCV library, which provides a
features.
Article
Talk
Read
Edit
View history
Tools
From Wikipedia, the free encyclopedia
Feature detection
Edge detection
Canny
Deriche
Differential
Sobel
Prewitt
Roberts cross
Corner detection
Harris operator
Shi and Tomasi
Level curve curvature
Hessian feature strength measures
SUSAN
FAST
Blob detection
Ridge detection
Hough transform
Hough transform
Generalized Hough transform
Structure tensor
Structure tensor
Generalized structure tensor
Feature description
SIFT
SURF
GLOH
HOG
Scale space
Scale-space axioms
Implementation details
Pyramids
v
t
e
The sum of the original image within a rectangle can be evaluated quickly using
the integral image, requiring evaluations at the rectangle's four corners.
SURF uses a blob detector based on the Hessian matrix to find points of interest.
The determinant of the Hessian matrix is used as a measure of local change
around the point and points are chosen where this determinant is maximal. In
contrast to the Hessian-Laplacian detector by Mikolajczyk and Schmid, SURF
also uses the determinant of the Hessian for selecting the scale, as is also done
by Lindeberg. Given a point p=(x, y) in an image I, the Hessian matrix H(p, σ) at
point p and scale σ, is:
Exploring Gradient
Location Orientation
Histogram (GLOH) for
Image Recognition and
Object Detection
Vincent Chung
·
Follow
4 min read
Apr 9, 2023
Advantages of GLOH:
Applications of GLOH:
import cv2
# Load an image
image = cv2.imread('image.jpg', cv2.IMREAD_GRAYSCALE)
# Use the GLOH features for image recognition or object detection tasks
# ...
Blob detection