0% found this document useful (0 votes)
53 views42 pages

Project Exhibition 2

The document discusses Convolutional Neural Networks (CNNs), which are a type of neural network commonly used for computer vision tasks like image classification. It describes the basic architecture of CNNs, including convolutional layers that extract features, pooling layers that downsample images, and fully connected layers for classification. An example demonstrates applying convolution, activation and pooling layers to an input image to extract features.

Uploaded by

Abinash Dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views42 pages

Project Exhibition 2

The document discusses Convolutional Neural Networks (CNNs), which are a type of neural network commonly used for computer vision tasks like image classification. It describes the basic architecture of CNNs, including convolutional layers that extract features, pooling layers that downsample images, and fully connected layers for classification. An example demonstrates applying convolution, activation and pooling layers to an input image to extract features.

Uploaded by

Abinash Dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Project -II

What is cnn
Geeks for geeks
A Convolutional Neural Network (CNN) is a type of Deep Learning neural
network architecture commonly used in Computer Vision. Computer vision is a field
of Artificial Intelligence that enables a computer to understand and interpret the
image or visual data.
When it comes to Machine Learning, Artificial Neural Networks perform really
well. Neural Networks are used in various datasets like images, audio, and text.
Different types of Neural Networks are used for different purposes, for example for
predicting the sequence of words we use Recurrent Neural Networks more
precisely an LSTM, similarly for image classification we use Convolution Neural
networks. In this blog, we are going to build a basic building block for CNN.
In a regular Neural Network there are three types of layers:
1. Input Layers: It’s the layer in which we give input to our model. The
number of neurons in this layer is equal to the total number of features in
our data (number of pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then fed into the hidden
layer. There can be many hidden layers depending on our model and data
size. Each hidden layer can have different numbers of neurons which are
generally greater than the number of features. The output from each layer
is computed by matrix multiplication of the output of the previous layer
with learnable weights of that layer and then by the addition of learnable
biases followed by activation function which makes the network
nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic
function like sigmoid or softmax which converts the output of each class
into the probability score of each class.
The data is fed into the model and output from each layer is obtained from the above
step is called feedforward, we then calculate the error using an error function, some
common error functions are cross-entropy, square loss error, etc. The error function
measures how well the network is performing. After that, we backpropagate into the
model by calculating the derivatives. This step is called Backpropagation which
basically is used to minimize the loss.
Convolution Neural Network
Convolutional Neural Network (CNN) is the extended version of artificial neural
networks (ANN) which is predominantly used to extract the feature from the grid-
like matrix dataset. For example visual datasets like images or videos where data
patterns play an extensive role.
CNN architecture
Convolutional Neural Network consists of multiple layers like the input layer,
Convolutional layer, Pooling layer, and fully connected layers.

Simple CNN architecture

The Convolutional layer applies filters to the input image to extract features, the
Pooling layer downsamples the image to reduce computation, and the fully
connected layer makes the final prediction. The network learns the optimal filters
through backpropagation and gradient descent.
How Convolutional Layers works
Convolution Neural Networks or covnets are neural networks that share their
parameters. Imagine you have an image. It can be represented as a cuboid having its
length, width (dimension of the image), and height (i.e the channel as images
generally have red, green, and blue channels).

Now imagine taking a small patch of this image and running a small neural network,
called a filter or kernel on it, with say, K outputs and representing them vertically.
Now slide that neural network across the whole image, as a result, we will get
another image with different widths, heights, and depths. Instead of just R, G, and B
channels now we have more channels but lesser width and height. This operation is
called Convolution. If the patch size is the same as that of the image it will be a
regular neural network. Because of this small patch, we have fewer weights.
Image source: Deep Learning Udacity

Now let’s talk about a bit of mathematics that is involved in the whole convolution
process.
 Convolution layers consist of a set of learnable filters (or kernels) having
small widths and heights and the same depth as that of input volume (3 if
the input layer is image input).
 For example, if we have to run convolution on an image with dimensions
34x34x3. The possible size of filters can be axax3, where ‘a’ can be
anything like 3, 5, or 7 but smaller as compared to the image dimension.
 During the forward pass, we slide each filter across the whole input
volume step by step where each step is called stride (which can have a
value of 2, 3, or even 4 for high-dimensional images) and compute the dot
product between the kernel weights and patch from input volume.
 As we slide our filters we’ll get a 2-D output for each filter and we’ll stack
them together as a result, we’ll get output volume having a depth equal to
the number of filters. The network will learn all the filters.
Layers used to build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. A
covnets is a sequence of layers, and every layer transforms one volume to another
through a differentiable function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x 32 x 3.
 Input Layers: It’s the layer in which we give input to our model. In CNN,
Generally, the input will be an image or a sequence of images. This layer
holds the raw input of the image with width 32, height 32, and depth 3.
 Convolutional Layers: This is the layer, which is used to extract the
feature from the input dataset. It applies a set of learnable filters known as
the kernels to the input images. The filters/kernels are smaller matrices
usually 2×2, 3×3, or 5×5 shape. it slides over the input image data and
computes the dot product between kernel weight and the corresponding
input image patch. The output of this layer is referred ad feature maps.
Suppose we use a total of 12 filters for this layer we’ll get an output
volume of dimension 32 x 32 x 12.
 Activation Layer: By adding an activation function to the output of the
preceding layer, activation layers add nonlinearity to the network. it will
apply an element-wise activation function to the output of the convolution
layer. Some common activation functions are RELU: max(0, x),
Tanh, Leaky RELU, etc. The volume remains unchanged hence output
volume will have dimensions 32 x 32 x 12.
 Pooling layer: This layer is periodically inserted in the covnets and its
main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting. Two
common types of pooling layers are max pooling and average pooling. If
we use a max pool with 2 x 2 filters and stride 2, the resultant volume will
be of dimension 16x16x12.

Image source: cs231n.stanford.edu

 Flattening: The resulting feature maps are flattened into a one-


dimensional vector after the convolution and pooling layers so they can be
passed into a completely linked layer for categorization or regression.
 Fully Connected Layers: It takes the input from the previous layer and
computes the final classification or regression task.
Image source: cs231n.stanford.edu

 Output Layer: The output from the fully connected layers is then fed into
a logistic function for classification tasks like sigmoid or softmax which
converts the output of each class into the probability score of each class.
Example:
Let’s consider an image and apply the convolution layer, activation layer, and
pooling layer operation to extract the inside feature.
Input image:
Input image

Step:
 import the necessary libraries
 set the parameter
 define the kernel
 Load the image and plot it.
 Reformat the image
 Apply convolution layer operation and plot the output image.
 Apply activation layer operation and plot the output image.
 Apply pooling layer operation and plot the output image.
Frequently Asked Questions (FAQs)
1: What is a Convolutional Neural Network (CNN)?
A Convolutional Neural Network (CNN) is a type of deep learning neural network
that is well-suited for image and video analysis. CNNs use a series of convolution
and pooling layers to extract features from images and videos, and then use these
features to classify or detect objects or scenes.
2: How do CNNs work?
CNNs work by applying a series of convolution and pooling layers to an input image
or video. Convolution layers extract features from the input by sliding a small filter,
or kernel, over the image or video and computing the dot product between the filter
and the input. Pooling layers then downsample the output of the convolution layers
to reduce the dimensionality of the data and make it more computationally efficient.
3: What are some common activation functions used in CNNs?
Some common activation functions used in CNNs include:
 Rectified Linear Unit (ReLU): ReLU is a non-saturating activation
function that is computationally efficient and easy to train.
 Leaky Rectified Linear Unit (Leaky ReLU): Leaky ReLU is a variant of
ReLU that allows a small amount of negative gradient to flow through the
network. This can help to prevent the network from dying during training.
 Parametric Rectified Linear Unit (PReLU): PReLU is a generalization of
Leaky ReLU that allows the slope of the negative gradient to be learned.
4: What is the purpose of using multiple convolution layers in a CNN?
Using multiple convolution layers in a CNN allows the network to learn increasingly
complex features from the input image or video. The first convolution layers learn
simple features, such as edges and corners. The deeper convolution layers learn
more complex features, such as shapes and objects.
5: What are some common regularization techniques used in CNNs?
Regularization techniques are used to prevent CNNs from overfitting the training
data. Some common regularization techniques used in CNNs include:
 Dropout: Dropout randomly drops out neurons from the network during
training. This forces the network to learn more robust features that are not
dependent on any single neuron.
 L1 regularization: L1 regularization regularizes the absolute value of the
weights in the network. This can help to reduce the number of weights and
make the network more efficient.
 L2 regularization: L2 regularization regularizes the square of the weights
in the network. This can also help to reduce the number of weights and
make the network more efficient.
6: What is the difference between a convolution layer and a pooling layer?
A convolution layer extracts features from an input image or video, while a
pooling layer downsamples the output of the convolution layers. Convolution
layers use a series of filters to extract features , while pooling layers use a variety
of techniques to downsample the data, such as max pooling and average pooling.
Artificial Intelligence and Machine Learning based Image
Processing
By V Srinivas Durga Prasad, Softnautics

Image processing is the process of converting an image to a digital format and then
performing various operations on it to gather useful information. Artificial Intelligence
(AI) and Machine Learning (ML) has had a huge influence on various fields of technology
in recent years. Computer vision, the ability for computers to understand images and
videos on their own, is one of the top trends in this industry. The popularity of computer
vision is growing like never before and its application is spanning across industries like
automobiles, consumer electronics, retail, manufacturing and many more. Image
processing can be done in two ways: Physical photographs, printouts, and other hard
copies of images being processed using analogue image processing and digital image
processing is the use of computer algorithms to manipulate digital images. The input in
both cases is an image. The output of analogue image processing is always an image.
However, the output of digital image processing may be an image or information
associated with that image, such as data on features, attributes, and bounding boxes.
According to a report published by Data Bridge Market Research analyses, the Image
processing systems market is expected to grow at a CAGR of 21.8% registering a market
value of USD 151,632.6 million by 2029. Image processing is used in a variety of use
cases today, including visualisation, pattern recognition, segmentation, image
information extraction, classification, and many others.

Image processing working mechanism


Artificial intelligence and Machine Learning algorithms usually use a workflow to
learn from data. Consider a generic model of a working algorithm for an Image
Processing use case. To start, AI algorithms require a large amount of high-quality data
to learn and predict highly accurate results. As a result, we must ensure that the images
are well-processed, annotated, and generic for AIML image processing. This is where
computer vision (CV) comes in; it is a field concerned with machines understanding
image data. We can use CV to process, load, transform, and manipulate images to
create an ideal dataset for the AI algorithm.

Let’s understand the workflow of a basic image processing system


An Overview of Image Processing System
Acquisition of image

The initial level begins with image pre-processing which uses a sensor to capture the
image and transform it into a usable format.

Enhancement of image

Image enhancement is the technique of bringing out and emphasising specific interesting
characteristics which are hidden in an image.

Restoration of image

Image restoration is the process of enhancing an image's look. Picture restoration, as


opposed to image augmentation, is carried out utilising specific mathematical or
probabilistic models.

Colour image processing

A variety of digital colour modelling approaches such as HSI (Hue-Saturation-Intensity),


CMY (Cyan-Magenta-Yellow) and RGB (Red-Green-Blue) etc. are used in colour picture
processing.

Compression and decompression of image

This enables adjustments to image resolution and size, whether for image reduction or
restoration depending on the situation, without lowering image quality below a desirable
level. Lossy and lossless compression techniques are the two main types of image file
compression which are being employed in this stage.

Morphological processing

Digital images are processed depending on their shapes using an image processing
technique known as morphological operations. The operations depend on the pixel values
rather than their numerical values, and well suited for the processing of binary images.
It aids in removing imperfections for structure of the image.

Segmentation, representation and description

The segmentation process divides a picture into segments, and each segment is
represented and described in such a way that it can be processed further by a computer.
The image's quality and regional characteristics are covered by representation. The
description's job is to extract quantitative data that helps distinguish one class of items
from another.

Recognition of image

A label is given to an object through recognition based on its description. Some of the
often-employed algorithms in the process of recognising images include the Scale-
invariant Feature Transform (SIFT), the Speeded Up Robust Features (SURF), and the
PCA (Principal Component Analysis).

Frameworks for AI image processing


 Open CV

OpenCV is a well-known computer vision library that provides numerous algorithms and
utilities to support the algorithms. The modules for object detection, machine learning,
and image processing are only a few of the many that it includes. With the help of this
programme, you may do picture processing tasks like data extraction, restoration, and
compression.

 TensorFlow

TensorFlow, created by Google, is one of the most well-known end-to-end machine


learning programming frameworks for tackling the challenges of building and training a
neural network to automatically locate and categorise images to a level of human
perception. It offers functionalities like work on multiple parallel processors, cross
platform, GPU configuration, support for a wide range of neural network algorithms, etc.
 PyTorch

Intended to shorten the time it takes to get from a research prototype to commercial
development, it includes features like a tool and library ecosystem, support for popular
cloud platforms, a simple transition from development to production, distribution
training, etc.

 Caffe

It is a deep learning framework intended for image classification and segmentation. It


has features like simple CPU and GPU switching, optimised model definition and
configuration, computation utilising blobs, etc.

Applications
 Machine vision

The ability of a computer to comprehend the world is known as machine vision. Digital
signal processing and analogue-to-digital conversion are combined with one or more
video cameras. The image data is transmitted to a robot controller or computer. This
technology aids companies in improving automated processes through automated
analysis. For instance, specialised machine vision image processing methods can
frequently sort parts more efficiently when tactile methods are insufficient for robotic
systems to sort through various shapes and sizes of parts. These methods use very
specific algorithms that consider the parameters of the colours or greyscale values in the
image to accurately define outlines or sizing for an object.

 Pattern recognition

The technique of identifying patterns with the aid of a machine learning system is called
pattern recognition. The classification of data generally takes place based on previously
acquired knowledge or statistical data extrapolated from patterns and/or their
representation. Image processing is used in pattern recognition to identify the items in
an image, and machine learning is then used to train the system to recognise changes in
patterns. Pattern recognition is utilised in computer assisted diagnosis, handwriting
recognition, image identification, character recognition etc.

 Digital video processing


A video is nothing more than just a series of images that move quickly. The number of
frames or photos per minute and the calibre of each frame employed determine the
video's quality. Noise reduction, detail improvement, motion detection, frame rate
conversion, aspect ratio conversion, colour space conversion, etc. are all aspects of video
processing. Televisions, VCRs, DVD players, video codecs, and other devices all
use video processing techniques.
 Transmission and encoding

Today, thanks to technological advancements, we can instantly view live CCTV footage
or video feeds from anywhere in the world. This indicates that image transmission and
encoding have both advanced significantly. Progressive image transmission is a
technique of encoding and decoding digital information representing an image in a way
that the image's main features, like outlines, can be presented at low resolution initially
and then refined to greater resolutions. An image is encoded by an electronic analogue
to multiple scans of the exact image at different resolutions in progressive transmission.
Progressive image decoding results in a preliminary approximate reconstruction of the
image, followed by successively better images whose adherence is gradually built up
from succeeding scan results at the receiver side. Additionally, image compression
reduces the amount of data needed to describe a digital image by eliminating extra data,
ensuring that the image processing is finished and that it is suitable for transmission.

 Image sharpening and restoration

Here, the terms "image sharpening" and "restoration" refer to the processes used to
enhance or edit photographs taken with a modern camera to produce desired results.
Zooming, blurring, sharpening, converting from grayscale to colour, identifying edges
and vice versa, image retrieval, and image recognition are included. Recovering lost
resolution and reducing noise are the goals of picture restoration techniques. Either the
frequency domain or the image domain is used for image processing techniques.
Deconvolution, which is carried out in the frequency domain, is the easiest and most
used technique for image restoration.

Image processing can be employed to enhance an image's quality, remove unwanted


artefacts from an image, or even create new images completely from scratch. Nowadays,
image processing is one of the fastest-growing technologies, and it has a huge potential
for future wide adoption in areas such as video and 3D graphics, statistical image
processing, recognising, and tracking people and things, diagnosing medical conditions,
PCB inspection, robotic guidance and control, and automatic driving in all modes of
transportation.

At Softnautics, we help industries to design Vision based AI solutions such as image


classification & tagging, visual content analysis, object tracking, identification, anomaly
detection, face detection and pattern recognition. Our team of experts have experience
in developing vision solutions based on Optical Character Recognition, NLP, Text
Analytics, Cognitive Computing, etc. involving various FPGA platforms.

Author: V Srinivas Durga Prasad


Srinivas is a Marketing professional at Softnautics working on techno-commercial write-
ups, marketing research and trend analysis. He is a marketing enthusiast with 7+ years
of experience belonging to diversified industries. He loves to travel and is fond of
adventures.

Sift
SIFT Algorithm: How to Use SIFT for Image Matching in
Python

Aishwarya Singh 20 Feb, 2024 • 13 min read

Introduction

Humans identify objects, people, and images through memory SIFT(scale

invariant feature transform) sift computer vision and understanding. The more

number of times you see something, the easier it is for you to recollect it. Also,

every time an image pops up in your mind, it relates that item or image to a

bunch of related images or things. What if I told you we could teach a machine to

do the same using a technique called the SIFT algorithm?

While humans can easily identify objects in images despite variations in angle or

scale, machines struggle with this task. However, through machine learning, we

can train machines to identify images at an almost human level, making

computer vision an exciting field to work in. In this tutorial, we will discuss SIFT –

an image-matching algorithm in data science that uses machine learning to

identify key features in images and match these features to a new image of the

same object.
Learning Objectives

 A beginner-friendly introduction to the powerful SIFT (Scale Invariant Feature

Transform) technique.

 Learn how to perform Feature Matching using the scale invariant feature

transform algorithm.

 Try hands-on coding of the SIFT(scale invariant feature transform) algorithm in

Python.

Table of contents
What Is SIFT Algorithm?

The SIFT (Scale-Invariant Feature Transform) algorithm is a computer vision

technique used for feature detection and description. It detects distinctive key

points or features in an image that are robust to changes in scale, rotation, and

affine transformations. SIFT(scale invariant feature transform) works by

identifying key points based on their local intensity extrema and computing

descriptors that capture the local image information around those key points.

These descriptors can then be used for tasks like image matching, object

recognition, and image retrieval.

Take a look at the below collection of images and think of the common element

between them:
The resplendent Eiffel Tower, of course! The keen-eyed among you will also

have noticed that each image has a different background, is captured from

different angles, and also has different objects in the foreground (in some cases).

I’m sure all of this took you a fraction of a second to figure out. It doesn’t matter if

the image is rotated at a weird angle or zoomed in to show only half of the

Tower. This is primarily because you have seen the images of the Eiffel Tower

multiple times, and your memory easily recalls its features. We naturally

understand that the scale or angle of the image may change, but the object

remains the same.


But machines have an almighty struggle with the same idea. It’s a challenge for

them to identify the object in an image if we change certain things (like the angle

or the scale). Here’s the good news – machines are super flexible, and we can

teach them to identify images at an almost human level.

This is one of the most exciting aspects of working in computer vision !

SIFT computer vision, or Scale Invariant Feature Transform, is a feature

detection algorithm in Computer Vision.

SIFT algorithm helps locate the local features in an image, commonly known as

the ‘keypoints‘ of the image. These keypoints are scale & rotation invariants that

can be used for various computer vision applications, like image matching, object

detection, scene detection, etc.

We can also use the keypoints generated using SIFT computer vision as features

for the image during model training. The major advantage of SIFT features,

over-edge features, or hog features is that they are not affected by the size

or orientation of the image.

For example, here is another image of the Eiffel Tower along with its smaller

version. The keypoints of the object in the first image are matched with the

keypoints found in the second image. The same goes for two images when the

object in the other image is slightly rotated. Amazing, right?


Let’s understand how these keypoints are identified and what the techniques

used to ensure the scale and rotation invariance are. Broadly speaking, the

entire process can be divided into 4 parts:

 Constructing a Scale Space: To make sure that features are scale-independent

 Keypoint Localisation: Identifying the suitable features or keypoints

 Orientation Assignment: Ensure the keypoints are rotation invariant

 Keypoint Descriptor: Assign a unique fingerprint to each keypoint

Finally, we can use these keypoints for feature matching!

This article is based on the original paper by David G. Lowe. Here is the

link: Distinctive Image Features from Scale-Invariant Keypoints .

Constructing the Scale Space

We need to identify the most distinct features in a given input image while

ignoring any noise. Additionally, we need to ensure that the features are not
scale-dependent. These are critical concepts, so let’s talk about them one by

one.

Gaussian Blur

We use the Gaussian Blurring technique to reduce the noise in an image.

For every pixel in an image, the Gaussian Blur calculates a value based on its

neighboring pixels with a certain sigma value. Below is an example of an image

before and after applying the Gaussian Blur. As you can see, the texture and

minor details are removed from the image, and only the relevant information, like

the shape and edges, remain:

Gaussian Blur helped in image processing and successfully removed the noise

from the images, and we have highlighted the important features of the image.

Now, we need to ensure that these features are scale-dependent. This means

we will be searching for these features on multiple scales by creating a ‘scale

space’.
Scale space is a collection of images having different scales, generated from a

single image.

Hence, these blur images are created for multiple scales. To create a new set of

images of different scales, we will take the original image and reduce the scale

by half. For each new image, we will create blur versions as we saw above.

Here is an example to understand it in a better manner. We have the original

image of size (275, 183) and a scaled image of dimension (138, 92). For both

images, two blur images are created:

You might be thinking – how many times do we need to scale the image, and

how many subsequent blur images need to be created for each scaled

image? The ideal number of octaves should be four, and for each octave, the

number of blur images should be five.


Difference of Gaussian

So far, we have created images of multiple scales (often represented by σ) and

used Gaussian blur for each of them to reduce the noise in the image. Next, we

will try to enhance the features using a technique called the Difference of

Gaussians or DoG.

Difference of Gaussian is a feature enhancement algorithm that involves the

subtraction of one blurred version of an original image from another, less blurred

version of the original.

DoG creates another set of images, for each octave, by subtracting every image

from the previous image in the same scale. Here is a visual explanation of how

DoG is implemented:
Note: The image is taken from the original paper. The octaves are now

represented in a vertical form for a clearer view.

Let us create the DoG for the images in scale space. Take a look at the below

diagram. On the left, we have 5 images, all from the first octave (thus having the

same scale). Each subsequent image is created by applying the Gaussian blur

over the previous image.

On the right, we have four images generated by subtracting the consecutive

Gaussians. The results are jaw-dropping!


We have enhanced features for each of these images. Note that here I am

implementing it only for the first octave, but the same process happens for all the

octaves.

Now that we have a new set of images, we are going to use this to find the

important keypoints.

Keypoint Localization

Once the images have been created, the next step is to find the important

keypoints from the image that can be used for feature matching. The idea is to

find the local maxima and minima for the images. This part is divided into two

steps:
1. Find the local maxima and minima

2. Remove low contrast keypoints (keypoint selection)

Local Maxima and Local Minima

To locate the local maxima and minima, we go through every pixel in the image

and compare it with its neighboring pixels.

When I say ‘neighboring’, this includes not only the surrounding pixels of that

image (in which the pixel lies) but also the nine pixels for the previous and next

image in the octave.

This means that every pixel value is compared with 26 other pixel values to find

whether it is the local maxima/minima called extrema. For example, in the below

diagram, we have three images from the first octave. The pixel marked x is

compared with the neighboring pixels (in green) and is selected as a keypoint or

interest point if it is the highest or lowest among the neighbors:

We now have potential keypoints that represent the images and are scale-

invariant. We will apply the last check over the selected keypoints to ensure that

these are the most accurate keypoints to represent the image.

Keypoint Selection
Kudos! So far, we have successfully generated scale-invariant keypoints. But

some of these keypoints may not be robust to noise. This is why we need to

perform a final check to make sure that we have the most accurate keypoints to

represent the image features.

Hence, we will eliminate the keypoints that have low contrast or lie very

close to the edge.

To deal with the low contrast keypoints, a second-order Taylor expansion is

computed for each keypoint. If the resulting value is less than 0.03 (in

magnitude), we reject the keypoint.

So what do we do about the remaining keypoints? Well, we perform a check to

identify the poorly located keypoints. These are the keypoints that are close to

the edge and have a high edge response but may not be robust to a small

amount of noise. A second-order Hessian matrix is used to identify such

keypoints. You can go through the math behind this here.

Now that we have performed both the contrast test and the edge test to reject the

unstable keypoints, we will now assign an orientation value for each keypoint to

make the rotation invariant.

Orientation Assignment

At this stage, we have a set of stable keypoints for the images. We will now

assign an orientation to each of these keypoints so that they are invariant to

rotation. We can again divide this step into two smaller steps:

1. Calculate the magnitude and orientation


2. Create a histogram for magnitude and orientation

Calculate Magnitude and Orientation

Consider the sample image shown below:

Let’s say we want to find the magnitude and orientation for the pixel value in red.

For this, we will calculate the gradients in the x and y directions by taking the

difference between 55 & 46 and 56 & 42. This comes out to be Gx = 9 and Gy =

14, respectively.

Once we have the gradients, we can find the magnitude and orientation using the

following formulas:

Magnitude = √[(Gx)2+(Gy)2] = 16.64

Φ = atan(Gy / Gx) = atan(1.55) = 57.17

The magnitude represents the intensity of the pixel and the orientation gives the

direction for the same.

We can now create a histogram given that we have these magnitude and

orientation values for the pixels.

Creating a Histogram for Magnitude and Orientation


On the x-axis, we will have bins for angle values, like 0-9, 10 – 19, 20-29, and up

to 360. Since our angle value is 57, it will fall in the 6th bin. The 6th bin value will

be in proportion to the magnitude of the pixel, i.e. 16.64. We will do this for all

the pixels around the keypoint.

This is how we get the below histogram:

You can refer to this article for a much more detailed explanation for calculating

the gradient, magnitude, orientation, and plotting histogram – A Valuable

Introduction to the Histogram of Oriented Gradients .

This histogram would peak at some point. The bin at which we see the peak

will be the orientation for the keypoint. Additionally, if there is another

significant peak (seen between 80 – 100%), then another keypoint is generated

with the magnitude and scale the same as the keypoint used to generate the
histogram. And the angle or orientation will be equal to the new bin that has the

peak.

Effectively at this point, we can say that there can be a small increase in the

number of keypoints.

Keypoint Descriptor

This is the final step for SIFT(scale invariant feature transform) computer vision.

So far, we have stable keypoints that are scale-invariant and rotation-invariant. In

this section, we will use the neighboring pixels, their orientations, and their

magnitude to generate a unique fingerprint for this keypoint called a ‘descriptor’.

Additionally, since we use the surrounding pixels, the descriptors will be partially

invariant to the illumination or brightness of the images.

We will first take a 16×16 neighborhood around the keypoint. This 16×16 block is

further divided into 4×4 sub-blocks and for each of these sub-blocks, we

generate the histogram using magnitude and orientation.


At this stage, the bin size is increased, and we take only 8 bins (not 36). Each of

these arrows represents the 8 bins, and the length of the arrows defines the

magnitude. So, we will have a total of 128 bin values for every keypoint.

Here is an example using pyplot in matplotlib:

import cv2
import matplotlib.pyplot as plt
%matplotlib inline

#reading image
img1 = cv2.imread('eiffel_2.jpeg')
gray1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)

#keypoints
sift = cv2.xfeatures2d.SIFT_create()
keypoints_1, descriptors_1 = sift.detectAndCompute(img1,None)

img_1 = cv2.drawKeypoints(gray1,keypoints_1,img1)
plt.imshow(img_1)

Feature Matching

We will now use the SIFT computer vision features for feature matching. For this

purpose, I have downloaded two images of the Eiffel Tower taken from different

positions. You can try it with any two images that you want.

Here are the two images that I have used:


import cv2
import matplotlib.pyplot as plt
%matplotlib inline

# read images
img1 = cv2.imread('eiffel_2.jpeg')
img2 = cv2.imread('eiffel_1.jpg')

img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)


img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)

figure, ax = plt.subplots(1, 2, figsize=(16, 8))

ax[0].imshow(img1, cmap='gray')
ax[1].imshow(img2, cmap='gray')

Now, for both of these images, we are going to generate the SIFT features. First,

we have to construct a SIFT object. We first create a SIFT computer vision object

using sift_create and then use the function detectAndCompute to get the

keypoints. It will return two values – the keypoints and the sift computer vision

descriptors.

Let’s determine the keypoints and print the total number of keypoints found in

each image:

import cv2
import matplotlib.pyplot as plt
%matplotlib inline

# read images
img1 = cv2.imread('eiffel_2.jpeg')
img2 = cv2.imread('eiffel_1.jpg')

img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)


img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)

#sift
sift = cv2.xfeatures2d.SIFT_create()

keypoints_1, descriptors_1 = sift.detectAndCompute(img1,None)


keypoints_2, descriptors_2 = sift.detectAndCompute(img2,None)

len(keypoints_1), len(keypoints_2)
283, 540

Next, let’s try and match the features from image 1 with features from image 2.

We will be using the function match() from the BFmatcher (brute force match)

module. Also, we will draw lines between the features that match both images.

This can be done using the drawMatches function in OpenCV python.

import cv2
import matplotlib.pyplot as plt
%matplotlib inline

# read images
img1 = cv2.imread('eiffel_2.jpeg')
img2 = cv2.imread('eiffel_1.jpg')

img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)


img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2GRAY)

#sift
sift = cv2.xfeatures2d.SIFT_create()

keypoints_1, descriptors_1 = sift.detectAndCompute(img1,None)


keypoints_2, descriptors_2 = sift.detectAndCompute(img2,None)

#feature matching
bf = cv2.BFMatcher(cv2.NORM_L1, crossCheck=True)

matches = bf.match(descriptors_1,descriptors_2)
matches = sorted(matches, key = lambda x:x.distance)

img3 = cv2.drawMatches(img1, keypoints_1, img2, keypoints_2, matches[:50], img2, flags=2)


plt.imshow(img3),plt.show()
I have plotted only 50 matches here for clarity’s sake. You can increase the

number according to what you prefer. To find out how many keypoints are

matched, we can print the length of the variable matches. In this case, the

answer would be 190.

Conclusion

In this article, we discussed the SIFT feature-matching algorithm in detail. Here

is a site that provides excellent visualization for each step of SIFT. You can add

your own image, and it will create the keypoints for that image as well. Check it

out here. Another popular feature-matching algorithm is SURF (Speeded Up

Robust Feature), which is simply a faster version of SIFT. I would encourage you

to go ahead and explore it as well.

And if you’re new to the world of computer vision and image data, I recommend

checking out the below course:

 Computer Vision using Deep Learning 2.0

Key Takeaways
 SIFT (Scale-Invariant Feature Transform) is a powerful technique for image

matching that can identify and match features in images that are invariant to

scaling, rotation, and affine distortion.

 It is widely used in computer vision applications, including image matching,

object recognition, and 3D reconstruction.

 The SIFT technique involves generating a scale space of images with different

scales and then using the Difference of Gaussian (DoG) method to identify

keypoints in the images.

 It also involves computing descriptors for each keypoint, which can be used for

feature matching and object recognition.

 It can be implemented using Python and the OpenCV library, which provides a

set of functions for detecting keypoints, computing descriptors, and matching

features.

Speeded up robust features


9 languages

 Article
 Talk
 Read
 Edit
 View history
Tools



From Wikipedia, the free encyclopedia

This article includes a list of general references, but it lacks sufficient


corresponding inline citations. Please help to improve this article
by introducing more precise citations. (January 2015) (Learn how and when to
remove this template message)

For other uses, see SURF (disambiguation).

Feature detection

Edge detection

 Canny
 Deriche
 Differential
 Sobel
 Prewitt
 Roberts cross

Corner detection

 Harris operator
 Shi and Tomasi
 Level curve curvature
 Hessian feature strength measures
 SUSAN
 FAST

Blob detection

 Laplacian of Gaussian (LoG)


 Difference of Gaussians (DoG)
 Determinant of Hessian (DoH)
 Maximally stable extremal regions
 PCBR

Ridge detection

Hough transform

 Hough transform
 Generalized Hough transform

Structure tensor

 Structure tensor
 Generalized structure tensor

Affine invariant feature detection

 Affine shape adaptation


 Harris affine
 Hessian affine

Feature description

 SIFT
 SURF
 GLOH
 HOG

Scale space

 Scale-space axioms
 Implementation details
 Pyramids

 v
 t
 e

In computer vision, speeded up robust features (SURF) is a patented local feature


detector and descriptor. It can be used for tasks such as object recognition, image
registration, classification, or 3D reconstruction. It is partly inspired by the scale-
invariant feature transform (SIFT) descriptor. The standard version of SURF is
several times faster than SIFT and claimed by its authors to be more robust against
different image transformations than SIFT.
To detect interest points, SURF uses an integer approximation of the determinant of
Hessian blob detector, which can be computed with 3 integer operations using a
precomputed integral image. Its feature descriptor is based on the sum of the Haar
wavelet response around the point of interest. These can also be computed with the
aid of the integral image.
SURF descriptors have been used to locate and recognize objects, people or faces,
to reconstruct 3D scenes, to track objects and to extract points of interest.
SURF was first published by Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, and
presented at the 2006 European Conference on Computer Vision. An application of
the algorithm is patented in the United States.[1] An "upright" version of SURF (called
U-SURF) is not invariant to image rotation and therefore faster to compute and better
suited for application where the camera remains more or less horizontal.
The image is transformed into coordinates, using the multi-resolution pyramid
technique, to copy the original image with Pyramidal Gaussian or Laplacian
Pyramid shape to obtain an image with the same size but with reduced bandwidth.
This achieves a special blurring effect on the original image, called Scale-Space and
ensures that the points of interest are scale invariant.
Algorithm and features[edit]
The SURF algorithm is based on the same principles and steps as SIFT; but details
in each step are different. The algorithm has three main parts: interest point
detection, local neighborhood description, and matching.
Detection[edit]
SURF uses square-shaped filters as an approximation of Gaussian smoothing. (The
SIFT approach uses cascaded filters to detect scale-invariant characteristic points,
where the difference of Gaussians (DoG) is calculated on rescaled images
progressively.) Filtering the image with a square is much faster if the integral
image is used:

The sum of the original image within a rectangle can be evaluated quickly using
the integral image, requiring evaluations at the rectangle's four corners.
SURF uses a blob detector based on the Hessian matrix to find points of interest.
The determinant of the Hessian matrix is used as a measure of local change
around the point and points are chosen where this determinant is maximal. In
contrast to the Hessian-Laplacian detector by Mikolajczyk and Schmid, SURF
also uses the determinant of the Hessian for selecting the scale, as is also done
by Lindeberg. Given a point p=(x, y) in an image I, the Hessian matrix H(p, σ) at
point p and scale σ, is:

where etc. is the convolution of the second-order derivative of

gaussian with the image at the point .


The box filter of size 9×9 is an approximation of a Gaussian with σ=1.2 and
represents the lowest level (highest spatial resolution) for blob-response
maps.
Scale-space representation and location of points of interest [edit]
Interest points can be found at different scales, partly because the search for
correspondences often requires comparison images where they are seen at
different scales. In other feature detection algorithms, the scale space is
usually realized as an image pyramid. Images are repeatedly smoothed with
a Gaussian filter, then they are subsampled to get the next higher level of the
pyramid. Therefore, several floors or stairs with various measures of the
masks are calculated:

The scale space is divided into a number of octaves, where an octave


refers to a series of response maps of covering a doubling of scale. In
SURF, the lowest level of the scale space is obtained from the output of
the 9×9 filters.
Hence, unlike previous methods, scale spaces in SURF are implemented
by applying box filters of different sizes. Accordingly, the scale space is
analyzed by up-scaling the filter size rather than iteratively reducing the
image size. The output of the above 9×9 filter is considered as the initial
scale layer at scale s =1.2 (corresponding to Gaussian derivatives
with σ = 1.2). The following layers are obtained by filtering the image with
gradually bigger masks, taking into account the discrete nature of integral
images and the specific filter structure. This results in filters of size 9×9,
15×15, 21×21, 27×27,.... Non-maximum suppression in a 3×3×3
neighborhood is applied to localize interest points in the image and over
scales. The maxima of the determinant of the Hessian matrix are then
interpolated in scale and image space with the method proposed by
Brown, et al. Scale space interpolation is especially important in this case,
as the difference in scale between the first layers of every octave is
relatively large.
Descriptor[edit]
The goal of a descriptor is to provide a unique and robust description of
an image feature, e.g., by describing the intensity distribution of the pixels
within the neighbourhood of the point of interest. Most descriptors are
thus computed in a local manner, hence a description is obtained for
every point of interest identified previously.
The dimensionality of the descriptor has direct impact on both its
computational complexity and point-matching robustness/accuracy. A
short descriptor may be more robust against appearance variations, but
may not offer sufficient discrimination and thus give too many false
positives.
The first step consists of fixing a reproducible orientation based on
information from a circular region around the interest point. Then we
construct a square region aligned to the selected orientation, and extract
the SURF descriptor from it.
Orientation assignment[edit]
In order to achieve rotational invariance, the orientation of the point of
interest needs to be found. The Haar wavelet responses in both x- and y-

directions within a circular neighbourhood of radius around the

point of interest are computed, where is the scale at which the


point of interest was detected. The obtained responses are weighted by a
Gaussian function centered at the point of interest, then plotted as points
in a two-dimensional space, with the horizontal response in
the abscissa and the vertical response in the ordinate. The dominant
orientation is estimated by calculating the sum of all responses within a
sliding orientation window of size π/3. The horizontal and vertical
responses within the window are summed. The two summed responses
then yield a local orientation vector. The longest such vector overall
defines the orientation of the point of interest. The size of the sliding
window is a parameter that has to be chosen carefully to achieve a
desired balance between robustness and angular resolution.
Descriptor based on the sum of Haar wavelet responses[edit]
To describe the region around the point, a square region is extracted,
centered on the interest point and oriented along the orientation as
selected above. The size of this window is 20s.
The interest region is split into smaller 4x4 square sub-regions, and for
each one, the Haar wavelet responses are extracted at 5x5 regularly
spaced sample points. The responses are weighted with a Gaussian (to
offer more robustness for deformations, noise and translation).
Matching[edit]
By comparing the descriptors obtained from different images, matching
pairs can be found.

Gradient Location and Orientation Histogram (GLOH)

Exploring Gradient
Location Orientation
Histogram (GLOH) for
Image Recognition and
Object Detection

Vincent Chung
·

Follow
4 min read

Apr 9, 2023

Image recognition and object detection are fundamental tasks in


computer vision and have numerous applications, ranging from
surveillance and security to robotics and autonomous vehicles. One
key aspect of these tasks is the extraction of informative and
discriminative features from images to enable accurate recognition
and detection. One such feature descriptor that has gained
popularity in recent years is the Gradient Location Orientation
Histogram (GLOH).

GLOH is an extension of the widely used SIFT (Scale-Invariant


Feature Transform) descriptor, which has been proven to be robust
to changes in scale, rotation, and illumination. However, SIFT
suffers from some limitations, such as being computationally
expensive and not being able to handle images with repetitive
patterns or cluttered backgrounds effectively. GLOH overcomes
these limitations by incorporating additional information about the
gradient orientation and location of image keypoints, making it
more robust and efficient for image recognition and object detection
tasks.
Key Features of GLOH:

Gradient Orientation: GLOH takes into account the gradient


orientation of keypoints, which provides rich information about the
local structure and texture of an image. By considering the gradient
orientation, GLOH can capture important details and edges in an
image, making it more discriminative for recognition and detection
tasks.

Location Information: GLOH also includes the location information


of keypoints, which allows it to capture the spatial distribution of
features in an image. This is particularly useful for tasks where the
relative positions of objects or regions in an image are important,
such as object detection and tracking.

Histogram Representation: GLOH represents the gradient


orientation and location information using histograms, which
provide a compact and efficient way to encode the local features. The
histograms are computed over different spatial scales and
orientations, which makes GLOH robust to scale and rotation
changes in the image.

Advantages of GLOH:

Robust to Scale, Rotation, and Illumination Changes: Like SIFT,


GLOH is invariant to changes in scale, rotation, and illumination,
which makes it suitable for various imaging conditions and
environments. This allows GLOH to be used in a wide range of
image recognition and object detection tasks.

Efficient and Scalable: GLOH is computationally efficient compared


to some other feature descriptors, such as SIFT. It can be computed
quickly even for large images or large datasets, making it scalable for
real-world applications.

Discriminative and Informative: GLOH captures rich information


about the local structure, texture, and spatial distribution of features
in an image, making it highly discriminative and informative for
recognition and detection tasks. This allows GLOH to achieve high
accuracy in challenging image recognition and object detection
scenarios.

Applications of GLOH:

Object Detection: GLOH can be used for object detection tasks,


where the goal is to localize and identify objects of interest in an
image. The robustness of GLOH to scale, rotation, and illumination
changes makes it suitable for detecting objects under different
imaging conditions and environments.

Image Recognition: GLOH can be used for image recognition tasks,


such as image classification and scene recognition, where the goal is
to identify the content or category of an image. The discriminative
power of GLOH allows it to capture important visual cues and
features for accurate image recognition.
image matching and retrieval tasks, allowing for quick and accurate
retrieval of similar images from a large database.

Robotics and Autonomous Vehicles: GLOH can be used in robotics


and autonomous vehicle applications for tasks such as object
detection, tracking, and navigation. The robustness of GLOH to
scale, rotation, and illumination changes makes it suitable for real-
world environments where lighting conditions and object
orientations may vary.

Implementation with OpenCV: To implement GLOH in Python, we


can use the popular computer vision library OpenCV. OpenCV
provides various functions to compute image gradients, histograms,
and keypoints, which can be used to implement GLOH. Here’s a
sample code snippet that demonstrates how to compute GLOH
features using OpenCV:

import cv2

# Load an image
image = cv2.imread('image.jpg', cv2.IMREAD_GRAYSCALE)

# Compute gradient magnitude and orientation using Sobel operators


grad_x = cv2.Sobel(image, cv2.CV_32F, 1, 0, ksize=3)
grad_y = cv2.Sobel(image, cv2.CV_32F, 0, 1, ksize=3)
mag, angle = cv2.cartToPolar(grad_x, grad_y, angleInDegrees=True)

# Compute keypoints using an appropriate keypoint detector (e.g. SIFT or


ORB)
keypoints = cv2.xfeatures2d.SIFT_create().detect(image, None)

# Compute GLOH features for each keypoint


gloh_features = []
for kp in keypoints:
x, y = int(kp.pt[0]), int(kp.pt[1])
scale = int(kp.size / 2)
histogram = cv2.calcHist([angle[y-scale:y+scale, x-scale:x+scale]],
[0], None, [36], [0, 360])
gloh_features.append(histogram)

# Concatenate the GLOH features into a single feature vector


gloh_features = cv2.normalize(cv2.hconcat(gloh_features), None)

# Use the GLOH features for image recognition or object detection tasks
# ...

Conclusion: Gradient Location Orientation Histogram (GLOH) is a


powerful feature descriptor for image recognition and object
detection tasks. It captures rich information about the gradient
orientation and location of keypoints, making it robust, efficient,
and discriminative. With the availability of libraries like OpenCV,
implementing GLOH in Python is straightforward, making it a
valuable tool for various computer vision applications.

Local energy-based shape histogram (LESH)

Blob detection

Feature detection (computer vision)

You might also like