0% found this document useful (0 votes)
20 views

Computer Vision NN Architecture

Computer vision tasks like image classification, object detection, and image segmentation can be accomplished using neural networks. Neural networks use techniques like convolutional layers to capture local patterns in images. Dense or fully connected layers are commonly used in the final layers for classification and regression. During training, forward propagation passes input data through the network to generate predictions, while backpropagation uses gradients to update parameters and minimize error through iterative training.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Computer Vision NN Architecture

Computer vision tasks like image classification, object detection, and image segmentation can be accomplished using neural networks. Neural networks use techniques like convolutional layers to capture local patterns in images. Dense or fully connected layers are commonly used in the final layers for classification and regression. During training, forward propagation passes input data through the network to generate predictions, while backpropagation uses gradients to update parameters and minimize error through iterative training.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Computer vision

NN architecture
Artificial Neural Networks (ANNs) have layers of neurons, They include an input layer for
feed input data, hidden layers for processing the inputdata.then output out of hidden layer
became input for the output layer for making final results. Neurons use weighted
connections and activation functions introduce non-linearities, capturing complex data
patterns. Its is repeated process. forward propagation, input data travels through layers, and
neurons calculate weighted sums, apply activation function then pass the information layer
by layer towards producing output. as well as In training process backpropagation model
modify the network through, gradient with respect to updating model weights for minimize
errors, then model make accurate predictions on new data after training

What is Gradient:
the gradient is a vector of partial derivatives of the function with respect to each variable.the
model trained by adjusting model parameters for minimize a predefined loss function that
indicates how the loss would change if the parameters are adjusted.

What is gradient decent

Gradient Descent is an optimization algorithm commonly used to finding minimum loss


function that measures the difference between the predicted value and the actual values.

Gradient vs gradient descent


gradient descent is a method used to find the minimum loss of a function. that measures the
difference between the predicted value and the actual values. It moving towards the
direction of the steepest slop where loss would be decrease . The "gradient" refers to the
vector of partial derivatives of the function with respect to its parameters, and this vector
guides the algorithm in where the steepest decrease in the function.

Produce:

Start with initial values for the parameters model parameters

Calculate the gradient of loss function with respect to each parameter. The gradient
indicates the direction in which the function is steepest.

Adjust the model parameters in the opposite direction of the gradient. Then it move
towards the minimum of the function
Repeat steps 2 and 3 until a stopping criterion is met, such as a predefined number of
iterations or when the change in the loss function becomes sufficiently small.

What is batch size

batch size is the number of data samples utilized in each iteration during the train model. It
influences the speed and stability of the training process, and helps manage memory
constrains. Common values for batch size include 32, 64, and 128, with smaller sizes
providing more frequent updates to the model and potentially faster convergence, while
largersizes may utilize computational resources more efficiently.

What is epoch

An epoch is making one complete round of the entire training dataset when the model in
training process. Then model updating model parameters based on the error calculated on
the entire training dataset. That’s improving the generalize and make accurate predictions
on unseen data.

What is Resize
Resize is adjusting images dimensions. Images maintain uniformity, helping model
cooperative and computational efficiency. Without resizing, models may face input size
inconsistencies, and difficulties in feature extraction.

what is Normalization:
scaling pixel values of images that will take into standardized range, often between 0 and 1.
each pixel value by the Dividing maximum possible value (255 for an 8-bit image) that
improve model convergence and numerical stability, during training. Without normalization,
there may be on impacting overall performance and stop the learning process.

what is batch Normalization


It normalizes the input of each layer in a mini-batch by subtracting the batch mean and
dividing by the batch standard deviation that helps mitigating internal covariate shift and
accelerates training.

Mini Bactch: A mini-batch is a subset of the training dataset that is used to compute the
gradient of the loss function when the train model
Internal Batch Normalization:

Applied within a neural network layer, normalizing the inputs during training to mitigate
internal covariate shift.

Commonly used in feedforward neural networks and convolutional neural networks (CNNs)

External Batch Normalization:

Normalization applied externally to the network, typically at the input or output of the entire
network.Useful in transfer learning

What is Data Augmention


Data augmentation a technique ,artificially increasing the training set by creating modified
copies of a images using existing data by applying random transformation such as rotation
and flipping. It enhances model generalization by exposing to different variations.
Augmentation preventing overfitting and increasing the model's performance and strength

Without data augmentation, the model may overfit , and poor performance on new data

Transfer learning
is a technique apply on a pre-trained model is adapted for a new task. It is used for large
data set , enhance computational efficiency, and improve generalization by leveraging
knowledge (features, weights etc)gained from the source task.
Idea
TRADITIONAL VS TRANSFOR LEARNING

Image Classification:
 Definition: Classifies an entire image into predefined classes.
 Example: Identifying objects in an image, such as classifying a photo as
containing a cat or a dog.
 Output: Single label or probability distribution over classes.

Object Detection:

 Definition: Identifies and localizes multiple objects within an image.


 Example: Detecting and bounding boxes around people, cars, and other
objects in a scene.
 Output: Multiple bounding boxes with associated class labels.

Image Segmentation:

 Definition: Divides an image into segments or regions based on pixel-level


understanding.
 Example: Assigning each pixel in an image to a specific class, providing
detailed object boundaries.
 Output: Pixel-wise labeled image, highlighting different objects or regions.
Forward Propagation:

 Definition: Forward propagation is the process where input data is passed


through the network to generate predictions.
 Process: Neurons calculate weighted sums, apply activation functions, and
pass information layer by layer towards the output.
 Output: The final layer produces the model's prediction or output.

Back Propagation:

back progation during the training process uses a gradient with reapect to upadting
the parameter to bring the error to a minimum sate

convolutional layer
A convolutional layer is a building block of Convolutional Neural Networks (CNNs),
designed for processing and analyzing structured images. Convolutional layers are
particularly effective in capturing local patterns and hierarchies of features in the
input data.

Dense Layer (Fully Connected Layer)


A dense layer, also known as a fully connected layer, is a layer where each neuron is
connected to every neuron in the previous layer.

It's commonly used in the final layers of neural networks for tasks like classification
and regression.

2.Dropout:

•Dropout is a regularization technique used to prevent overfitting.at time of training,


it randomly "drops out" (deactivates) a fraction of neurons, push the network to learn
from more robust features.•It helps prevent the network from stay too heavily on
any one feature, that’s improving generalization.

3.Sequential Model:

•A sequential model is a type of neural network architecture that consists of a linear


stack of layers. Data flows sequentially through these layers.

•It's suitable for building straightforward feedforward neural networks where each
layer has one input and one output.

4.AveragePooling2D:

•Average pooling is a type of pooling layer used in convolutional neural networks.


•It replaces a region of the input with the average value of the data in that region. It
helps reduce spatial dimensions while retaining important features.

5.Type of Pooling (e.g., Max-Pooling, Average Pooling):

•Pooling layers are used in convolutional neural networks to reduce spatial


dimensions and control overfitting.

•Max-pooling selects the maximum value in a region, while average pooling


computes the average value.

•Pooling helps retain essential information while reducing calculation complexity.

6.Padding:

•Padding is the process of adding extra (typically zero) values around the input data
before applying convolution operations.

•Padding helps control the spatial dimensions of feature maps after convolution and
prevents information loss from the image boundaries.

7.Flattening:

•Flattening is the process of converting a multi-dimensional data structure (e.g., a


feature map) into a one-dimensional vector.

•It's often used to connect convolutional or pooling layers to fully connected layers.

8.Losses (e.g., Mean Squared Error, Categorical Cross-Entropy):

•Loss functions measure the error between predicted and actual values and guide the
optimization process during training.

•Mean Squared Error (MSE) is used for regression tasks, while Categorical Cross-
Entropy is common for classification tasks.

•The choice of loss function depends on the specific problem being addressed.

Activation Function(af)
Activation functions are mathematical operations applied to the output of each neuron in a
neural network. They introduce non-linearity, to learning complex patterns and relationships
in the data. Without activation functions, neural networks would reduce to linear
transformations, less ability to capture non-linear structures. Activation functions enable to
learning data during the backpropagation process, allowing the model to adjust weights and
optimize for improved performance.

Sigmoid Activation Function: The sigmoid activation function range input values
between 0 and 1, it suitable for binary classification tasks. It processes inputs ploting
them onto a sigmoid curve, producing outputs that represent probabilities.
commonly used in binary classification output layers, the sigmoid function may suffer
from the vanishing gradient problem when applied to hidden layers. Therefore, it is
not frequently used in hidden layers where the vanishing gradient issue can big
effective training.

input of the Sigma function is too small or two large then gradient become too small
and approaching the zero that means the Waves cannot be update in the back
propagation the weight can properly not properly updated it can't pass the
information to the another this list two fail the network is failed in network

ReLU Activation Function: The Rectified Linear Unit (ReLU) activation function input
range between 0 and 1. promoting sparsity and accelerating training in hidden layers
by setting negative inputs to zero and unchanged the positive value. It is commonly
used in hidden layers to address vanishing gradient issues,.

Tanh Activation Function: The hyperbolic tangent (tanh) activation function range
input values between -1 and 1, it require symmetric relationships. It processes inputs
by mapping them to the range (-1, 1) using a tanh curve. Tanh is suitable for where
outputs need to symmetry around zero, offering advantages in modeling symmetric
patterns.

Softmax Activation Function: The softmax activation function transforms inputs


into a probability distribution, commonly used in the output layers of neural
networks for multi-class classification tasks. It computes class probabilities using
exponentials and normalization, and sum of probabilities across all classes equals
one. Softmax is used in situations where multi-class classification requires outputs to
represent class probabilities.

Leaky ReLU Activation Function: The Leaky ReLU activation function extends ReLU
by allowing small negative values, mitigating the "dying ReLU" problem. It introduces
a small negative slope for negative inputs, preventing complete inactivation. Leaky
ReLU is often used in hidden layers of deep networks to address the issue of neurons
becoming inactive during training, thereby enhancing the model's capacity to
capture diverse features and patterns.
Used to Situation:

 Sigmoid: Binary classification problems where the output needs to represent


probabilities.
 ReLU: Hidden layers in deep networks for simplicity and avoiding vanishing
gradient problems.
 Tanh: Situations requiring symmetric relationships and bounded output.
 Softmax: Multi-class classification problems to obtain class probabilities.
 Leaky ReLU: Used in hidden layers to mitigate the "dying ReLU" issue.

Gradient Descent Optimizer


Gradient Descent Optimizer: The gradient descent optimizer is a fundamental
optimization algorithm used in training neural networks. It iteratively adjusts model
parameters in the direction of steepest descent of the loss function. The process
involves computing gradients, determining the step size (learning rate), and updating
weights. While simple, gradient descent may face challenges like convergence speed
and sensitivity to the learning rate.

Adam Optimizer: The Adam optimizer, short for Adaptive Moment Estimation, is a
popular optimization algorithm that combines ideas from both momentum and
RMSprop. It adapts the learning rates for each parameter individually, providing
robustness to different types of gradients. Adam includes mechanisms to counteract
biases introduced in gradient estimation and is widely used in various deep learning
tasks due to its effectiveness and efficiency.

Adagrad Optimizer: The Adagrad optimizer adjusts the learning rates of model
parameters based on the historical gradient information. It accumulates squared
gradients in the denominator, allowing for larger updates for infrequent parameters.
While Adagrad is effective for sparse data, it can suffer from diminishing learning
rates over time, leading to slow convergence in certain scenarios.

RMSprop Optimizer: Root Mean Square Propagation (RMSprop) is an optimizer


designed to address the diminishing learning rates problem of Adagrad. It
normalizes gradients using the moving average of squared gradients, preventing the
learning rates from becoming too small. RMSprop is beneficial for non-stationary
and noisy environments, offering improved convergence in various settings.

Adamax Optimizer: The Adamax optimizer is a variant of Adam that uses the infinity
norm (maximum absolute value) of the past gradients. It has been introduced to
address potential issues in Adam where the L2 norm could become very small.
Adamax offers improved stability and is less sensitive to large gradients, making it
suitable for specific deep learning applications.

global minimum:
A local minimum of a function is a point where the function value is smaller than at nearby points,
but possibly greater than at a distant point.

A global minimum is a point where the function value is smaller than at all other feasible points.

Vanishing Gradient Problem:


gradient of loss function with respect to weights become too small then the
gradient approaches zero in during the back propagation weights updating
negligible that means network cannot learn efficiently that leads to low convergence
and stop the training process

Exploding Gradient Problem: exploding gradient problem occurs when gradients


become excessively large during backpropagation. This can lead to fluctuating
weights and cause the model parameters to diverge rather than converge to optimal
values..

Dying ReLU Problem: The dying ReLU problem refers to a situation where neurons
using the Rectified Linear Unit (ReLU) activation function always output zero for any
input, rendering them inactive. This can happen when the weights associated with a
ReLU neuron become negative and consistently output zero during forward
propagation. In such cases, the neuron fails to update its weights during
backpropagation, impeding the learning process. Techniques like Leaky ReLU, which
allows small negative values, are employed to mitigate the dying ReLU problem and
maintain the adaptability of the network during training.

Overfiting techniques

1. **Data Augmentation:**

- **Definition:** Generate additional training examples by applying random


transformations (e.g., rotations, flips, zooms) to existing images.

- **Purpose:** Increases the diversity of the training set, helping the model become
more robust and preventing it from memorizing specific instances.
2. **Dropout:**

- **Definition:** During training, randomly deactivate a fraction of neurons in the


network to prevent reliance on specific neurons and enhance generalization.

- **Purpose:** Reduces the risk of overfitting by preventing the network from


becoming overly dependent on individual neurons.

3. **Weight Regularization (L1 and L2):**

- **Definition:** Introduce penalty terms in the loss function that discourage large
weights (L2 regularization) or non-zero weights (L1 regularization).

- **Purpose:** Discourages the model from fitting noise in the training data and
encourages a simpler, more generalized model.

4. **Early Stopping:**

- **Definition:** Monitor the model's performance on a validation set during


training and stop training when the performance stops improving.

- **Purpose:** Prevents the model from continuing to learn the training data too
well, ensuring it does not overfit by halting training at an optimal point.

5. **Use of Pre-trained Models:**

- **Definition:** Start with a pre-trained model on a large dataset and fine-tune it


for the specific computer vision task.

- **Purpose:** Leverages knowledge learned from a diverse dataset, facilitating


better generalization and preventing overfitting on smaller, task-specific datasets.

6. **Batch Normalization:**

- **Definition:** Normalize input batches to each layer during training to stabilize


and accelerate the learning process.

- **Purpose:** Mitigates internal covariate shift and helps prevent overfitting by


improving the stability of the model's training.
7. **Cross-Validation:**

- **Definition:** Split the dataset into multiple folds, train the model on different
subsets, and evaluate its performance on the remaining unseen data.

- **Purpose:** Provides a more robust assessment of the model's generalization


performance, reducing the risk of overfitting to a specific subset of the data.

These techniques can be used individually or in combination to effectively prevent


overfitting in computer vision models, promoting better generalization and
performance on new, unseen data.

Pre_defined_algorithms

**VGG (Visual Geometry Group):**

VGG is a convolutional neural network architecture known for its simplicity and
uniform structure. It consists of several convolutional layers followed by max-pooling
layers, and it achieved high performance in image classification tasks. VGG networks
come in different versions, such as VGG16 and VGG19, where the numbers indicate
the layers' depths.

**ResNet (Residual Network):**

ResNet introduced the concept of residual learning, using skip connections to


address the vanishing gradient problem in very deep networks. By allowing the direct
flow of information from one layer to another, ResNet architectures, including
ResNet50 and ResNet101, facilitated the training of exceptionally deep neural
networks, leading to improved performance in image recognition tasks.

**Inception (GoogLeNet):**

Inception, also known as GoogLeNet, introduced the concept of inception modules,


which use multiple filter sizes within the same layer. This architecture allows the
network to capture features at different scales efficiently. Inception models are
recognized for their computational efficiency and were the winners of the ImageNet
Large Scale Visual Recognition Challenge in 2014.
**MobileNet:**

MobileNet is designed for mobile and edge devices, aiming for lightweight and
efficient neural network architectures. It utilizes depthwise separable convolutions to
reduce the number of parameters and computations while maintaining competitive
accuracy. MobileNet models are suitable for real-time applications with limited
computational resources.

**ResNeXt (Residual Next):**

ResNeXt extends the ResNet architecture by introducing a "cardinality" parameter,


allowing for increased model complexity and parallelization. By using a combination
of wider and deeper paths in residual blocks, ResNeXt achieves improved accuracy
and efficiency. It is particularly effective in tasks demanding high precision and
generalization.

CNN
A Convolutional Neural Network (CNN) is architecture designed for tasks like image
recognition. It contains convolutional layers that automatically learn hierarchical
features from input data, followed by pooling layers to reduce spatial dimensions.
Activation functions like ReLU introduce non-linearity, aiding in complex pattern
recognition. Fully connected layers capture global patterns, while normalization and
regularization layers enhance stability. CNNs use filters to extracting local features.
Striding and padding control spatial dimensions. Popular architectures like VGG,
ResNet, and Inception leverage variations of these components, making CNNs highly
effective for computer vision tasks.

Stride:
"stride" the step size that the convolutional filter takes when sliding over the input data
(image or feature map) during the convolution operation. The stride determines how much
the filter shifts at each step.

filter
filter is a small matrix used for the convolution operation. it is used to detect
patterns, features, within an input data, such as an image.

The filter is moved across the input data (e.g., an image) in a systematic way, and at
each position, it computes the dot product between its values and the values of the
input data
GAN:gan
Generative Adversarial Networks (GANs) it have a generator and a discriminator network.
The generator creates synthetic data, while the discriminator fine difference between real and
generated samples. The two networks are trained simultaneously through Opposing training

The process begins with the initialization of the generator and discriminator networks,
assigning them random weights. The generator then takes random noise as input and
produces synthetic data, such as images. Subsequently, the discriminator evaluates both real
data from the training set and the generated fake data, assigning probabilities to determine
their originality. The adversarial loss is calculated based on the generator's objective to create
data that matches the real data without difference and the discriminator's goal to accurately
classify real and fake samples. Gradients are then backpropagated through both networks,
updating their weights to minimize their respective losses. This iterative training process
repeats, enabling the generator to generate increasingly realistic data while the discriminator
enhances its ability to differentiate between real and fake samples. Ideally, the GAN
converges to a point where the generator produces data challenging for the discriminator
distinguish, resulting in the generation of realistic synthetic data. The trained GAN can be
employed to generate new data with characteristics similar to the training set, showcasing its
ability to capture and replicate underlying patterns. However, achieving this equilibrium can
be challenging, and GANs may encounter issues like mode collapse or instability during
training.

R-CNN

Region-Based CNNs (R-CNN) revolutionized object detection by integrating Convolutional


Neural Networks (CNNs) with a region proposal mechanism. Unlike traditional methods
examining the entire image, R-CNN strategically processes potential object-containing
regions. The initial step involves generating region proposals through methods like selective
search, identifying candidate bounding boxes. Subsequently, each proposal undergoes
feature extraction by passing through a pre-trained CNN, converting variable-sized regions
into fixed-sized feature vectors. The extracted features are then employed for object
classification and refinement of bounding box coordinates, typically involving additional
layers. The output of the R-CNN includes predicted class labels and refined bounding box
coordinates for each region proposal, allowing the algorithm to detect and classify objects
accurately within the image. This two-stage process ensures effective object localization and
identification, making R-CNN a seminal approach in the field of computer vision.

The loss calculation in Region-Based CNNs (R-CNN) plays a crucial role in training
the model to accurately localize and classify objects within images. The primary
components of the loss function include classification loss and bounding box
regression loss. The classification loss measures the disparity between predicted class
labels and actual labels for each region proposal. This ensures that the algorithm
correctly identifies the object categories. Simultaneously, the bounding box
regression loss evaluates the accuracy of predicted bounding box coordinates in
relation to the ground truth. These two components are typically combined, forming
a composite loss that the algorithm strives to minimize during training.

Faster_R-CNN

Faster R-CNN builds upon the foundations laid by R-CNN, introducing a key innovation to
enhance computational efficiency. This object detection algorithm integrates a Region
Proposal Network (RPN) directly into the Convolutional Neural Network (CNN) architecture,
eliminating the need for a separate region proposal mechanism like selective search. The
RPN efficiently generates region proposals on the CNN's feature map, evaluating potential
bounding boxes and assigning objectness scores. These proposals, along with their
associated scores, undergo feature extraction in the shared CNN, ensuring a consistent and
streamlined process. The features extracted are then employed for both object classification
and refinement of bounding box coordinates, following the principles of the original R-CNN.
The output of Faster R-CNN includes predicted class labels and refined bounding box
coordinates for each region proposal, providing accurate localization and classification of
objects within the image. This architectural refinement significantly improves processing
speed, making Faster R-CNN particularly suited for real-time applications without
compromising on detection precision.

Faced Problems

In object detection R-CNN, Fast CNN, Faster CNN models are faced these problems

includes

Training the data is unwieldy and too long

Training happens in multiple phases (e.g. training region proposal vs classifier)

Network is too slow at inference time (i.e. when dealing with non-training data

SSD…
SSD utilizes VGG,Inception,Resnet architecture without its fully connected layers,.

SSD (Single Shot MultiBox Detector) Contains a backbone model and an SSD head. The
backbone act as a feature extractor.The SSD head, additional convolutional layers added to
the backbone, to identify bounding boxes and object classes in various locations within the
image.

Workflow in SSD:

The image is fed into the SSD network.

FEATURE EXTRAACTORE

The feature extractor in SSD have a stack of convolutional layers followed by pooling layers.
This arrangement aims to extract a set of multi-scale feature maps in a hierarchical fashion
from the input image. here , lower layers focus on extracting edges and textures,
foundational elements of the image, while higher layers capturing more complex and
complex features objects in the image. The pooling layers, whether max pooling or average
pooling, play a crucial role. They downsample the spatial dimensions of the feature maps,
effectively reducing the resolution while retained the most important information.

Anchor Boxes (Prior Boxes):

SSD uses a set of anchor boxes of different aspect ratios and scales at each spatial location in the
feature maps.

These anchor boxes act as reference boxes that are placed throughout the image. They cover a range
of sizes and aspect ratios to handle variations in object appearance.

During training, the model predicts adjustments (offsets) to these anchor boxes, allowing them to
adapt to the specific locations and sizes of objects in the image.

BOUNDING BOX:

The model predicts bounding box adjustments for each anchor box at each spatial
location. These adjustments consist of offsets for the box's center coordinates, width,
and height.
The predicted adjustments are applied to the corresponding anchor boxes to obtain
the final bounding box predictions.

Bounding box adjustments allow the model to accurately localize objects, and class
scores are predicted for each adjusted bounding box to determine the likelihood of
an object belonging to a specific class.

Predicting HEAD

SSD has prediction heads attached to different layers of the network, each
responsible for predicting:

Bounding Box Offsets: Adjustments to the default anchor boxes to accurately fit the
target objects.

Class Scores: Confidence scores for each class, indicating the likelihood of an object
belonging to a particular category.

NMS

The predicted bounding boxes and class scores are used in post-processing steps, such as
Non-Maximum Suppression (NMS), to filter out redundant and overlapping detections.

Metrics:

ntersection over Union (IoU):

Definition: IoU measures the overlap between the predicted bounding box and the
ground truth bounding box.

Use: Commonly used for evaluating the accuracy of object localization. IoU is
calculated as the intersection area divided by the union area of the two bounding
boxes.

Precision, Recall, and F1 Score:

Precision: Measures the accuracy of positive predictions, i.e., the ratio of correctly
predicted positive instances to the total predicted positives.

Recall (Sensitivity): Measures the ability of the model to capture all the relevant
instances, i.e., the ratio of correctly predicted positive instances to the total actual
positives.

F1 Score: The harmonic mean of precision and recall, providing a balance between
the two metrics.
Use: Evaluation of both localization and classification performance.

Average Precision (AP) and Mean Average Precision (mAP):

Average Precision: Computes the average precision-recall curve and calculates the
area under the curve (AUC).

Mean Average Precision: Averages the AP scores across different object classes.

Use: Provides a comprehensive measure of the model's performance across different


precision-recall trade-offs

YOLO
YOLO, which stands for "You Only Look Once," is a real-time object detection system that can
detect multiple objects in an image in a single forward pass of the neural network. YOLO was
introduced to address the trade-off between accuracy and speed in object detection tasks.

Input Image:

Suppose we have a 448x448 RGB image as our input.

Convolutional Layers:

The image is passed through a series of convolutional layers. These layers learn hierarchical
features from the input image. Each layer captures different levels of abstraction, from edges
to more complex patterns. majority of convolution layers use their Leaky railu because it
preventive dead neurons and allow the small negative slope

The final layer, responsible for predicting class probabilities and bounding boxes, often uses
a linear activation function.

A linear activation function is used in the final layer to produce unbounded real values for
bounding box coordinates and confidence scores.

Grid Division:

The feature map resulting from the convolutional layers is divided into a grid. Each cell in the
grid is responsible for predicting bounding boxes and class probabilities for objects within its
region.

Bounding Box and Class Prediction:

bounding box is describe the spatial location of an object or ground-truth object locations.
Each bounding box is represented by (x, y, w, h, confidence), where (x, y) are the coordinates
of the box's center, (w, h) are the width and height, and confidence is the confidence score.

Anchor Boxes (Optional):


Anchor boxes are a set of predefined bounding boxes of a certain height and width. These
boxes are defined to capture the scale and aspect ratio of specific object classes you want to
detect. The model predicts adjustments (offsets) to these anchor boxes, allowing the model
to adapt the predefined shapes to the actual shapes of objects.

Output Tensor:

The final output is a tensor of shape (S, S, B * (5 + C)), where S is the grid size, B is the
number of bounding boxes per grid cell, and C is the number of classes.

The tensor contains all the bounding box and class probability predictions for the entire grid.

Non-Maximum Suppression:

Post-processing involves applying non-maximum suppression to filter out redundant or low-


confidence bounding boxes.

Non-maximum suppression ensures that only the most confident and non-overlapping
bounding boxes are retained.

Def/a/b:
Bounding boxes are provided as annotations in object detection datasets, representing
ground-truth object locations. Anchor boxes are used as templates for prediction during
model training and inference. The model predicts adjustments to anchor boxes to match
objects' actual locations

YOLO Architecture Overview:

 YOLO is a convolutional neural network (CNN) designed for real-time object


detection.
 It consists of a total of 24 convolutional layers followed by 2 fully connected
layers.
 The architecture is divided into two parts:
 The first 20 convolutional layers followed by an average pooling layer
and a fully connected layer are pre-trained on the ImageNet dataset, a
1000-class classification dataset.
 The last 4 convolutional layers followed by 2 fully connected layers are
added for object detection.

**Non-Maximum Suppression (NMS) Purpose:**


NMS is a post-processing step in object detection that removes redundant bounding boxes
to retain the most confident and accurate detections. It ensures that only the most suitable
bounding box for each object remains, eliminating overlaps and enhancing the precision of
the final detection results. NMS is crucial for refining the output and improving the reliability
of the detected object locations.
Localization Loss (Location or Smooth L1 Loss):
The localization loss measures the difference between the predicted bounding box coordinates and the ground truth bounding
box coordinates. The Smooth L1 loss function is commonly used for this purpose. It is less sensitive to outliers

confidence Loss (Softmax Loss):


The confidence loss measures the difference between predicted class scores and the ground truth class scores. The Softmax
loss function is typically used for this purpose. It encourages the correct class to have a high score and suppresses the scores
for incorrect classes.

You might also like