Computer Vision Part 2
Computer Vision Part 2
The computer vision applications are based on a certain number of tasks performed on an input
image to get the desired output which can be used to do predictions or analysis of data.
The tasks on the bases of which computer vision applications are based on
1) Single Objects
i) This means giving on image as input to the computer vision application.
ii) It can further be divided into 2 categories
a) Classification
Classification is the process of finding out the class/category of the input
image.
The input image is processed using a machine learning algorithm and
classified into predefined categories.
The most popular architecture used for image classification is
Convolutional Neural Network (CNN)
b) Classification + Localization
Localization means where the object is in the image.
The combined task of classification and localization means processing the
input image to identify its category along with the location of the object in
the image.
2) Multiple Object
i) This means giving multiple images to the computer vision application.
ii) It can further be divided into 2 categories
a) Object Detection
Object detection is the process of identifying or detecting the instances of
real-world objects like cars, bicycles, buses, animals, humans or anything
on which the detection model has been trained.
Object detection draws a boundary box around each object in the image.
Object detection algorithms extract the features of the object and after
that the machine learning algorithms will recognize the instances of an
object category by matching it with the sample images already fed into the
system.
b) Image Segmentation
It the computer vision task that involves identifying and separating
individual objects within an image, including detecting the boundaries of
each object and assign a unique label to each object.
A segmentation takes an input image as input and outputs a collection of
regions (segments).
What is Resolution?
i) Resolution is basically the dimensions through which we can measure how many pixels
are on a screen.
ii) Screen resolution is calculated by displaying the numbers of pixels, displayed vertically
or with the number of pixels displayed horizontally.
iii) For example, a full HD screen displays a popular HD of 1080p which means 1080 pixels
tall by 1920 pixels wide.
Basics of Image
1) Greyscale Image
i) A greyscale image is the one in which the value of each pixel is single i.e it carries
only intensity information.
ii) It is also know as black and white image.
iii) These are the images with only 2 colors, black and white, varying form black at the
weakest intensity to white at the strongest.
iv) A grayscale image has pixel of size 1 byte having a single plane of 2D array of pixels.
v) In grayscale images, the shades range starts with 0 and ends with 255 i.e it starts
with pure black and ends with pure white.
2) RGB Images
i) All colored images around us are made of 3 primary colors of Red, Green and Blue.
ii) All the colors are made by mixing the 3 basics colors of the RGB in varying intensity.
iii) Every colored image when split is stored in the form of 3 different channels, R
channel, G channel and B channel.
iv) Each channel has a pixel value varying from 0-255 .
v) In a colored image a single pixel contains red, green and blue values in triplets.
Note: An image is made up of pixels and these pixels are arranged in a 2D matrix to form a digital
image.
Kernel
1) Kernel is also known as a convolution matrix or mask that help in image processing by
creating a wide range of effects like sharp, blur, masking etc.
2) The kernel is slid across the image and multiplied with the input image matrix to generate
an output image with an enhanced desired effect.
1) Neural networks are a series of algorithms used to recognize hidden patterns in raw data,
cluster and classify it, and continuously learn and improve.
2) The main advantage is that the data features can be extracted automatically by the
machine without the input from the developer.
3) Neural networks are primarily used for solving problems with large datasets like images.
4) A neural network is divided into multiple layers and each layer is further divided into
several blocks called nodes.
5) First we have the input layer which receives the input in several different formats provided
by the programmer and feeds it to the neural network. No processing occurs in the input
layer. The output layer predicts our final output. The output at each node is called its
activation or node value.
1) Convolution Layer
i) Convolution is the first layer of CNN and is also known as Feature Extractor Layer.
ii) The main purpose of this layer is to extract the high-level feature from the input
image to perform operations such as edge detection, blur and sharpen by applying
filters.
iii) This layer deals with the convolution process of handling an image with several
types of kernels to provide features to the whole system.
iv) Each convolution kernel is used to generate a feature map based on input provided.
v) Feature maps have multiple uses like:
a) The output of the filter applied to the previous layer is trapped by the feature
map.
b) It helps in reducing the size of the image so that it can be processed easily.
c) It helps in focusing on the important features of the images like eyes, nose etc.
so that it can be processed efficiently.
2) Rectified Linear Unit (ReLU)
i) This layer is the next after the convolution layer.
ii) It takes the features maps of the convolution layer and generates the activation map
by discarding all the negative numbers of the feature maps. It means all positive
numbers will go as it to the system but all negative numbers will go as zero which
makes the feature map appear as Non-Linear graph with all positive values.
3) Pooling Layer
i) This layer reduces the dimensions of the input image while still retaining the
important features.
ii) This will help in making the input image more resistant to small transformations,
distortions and translations.
iii) All this is done to reduce the number of parameters and computation in the
network thus making it more manageable and improving the efficiency of the whole
system.
iv) There are two types of pooling
a) Max Pooling: Max Pooling is the most commonly used method that selects the
maximum value of the current image view and helps preserve the maximum
detected features.
b) Average Pooling: Average Pooling finds out the average value of the current
image view and thus down samples the feature map.
4) Fully Connected Layer
i) This is the last and final layer of the convolution neural network.
ii) After the features of the input image are extracted by the convolution layers and
downsampled by the pooling layers, their output is a 3-dimensional matrix which is
flattened into a vector of values.
iii) These values of the single vector represent a specific feature of a specific label and
are redirected to fully connected layers to predict the final outputs of the network.
iv) This helps in classifying an image into a specific label based on the probability of the
input being in a specific class.