Computer Vision NN Architecture
Computer Vision NN Architecture
NN architecture
Artificial Neural Networks (ANNs) have layers of neurons, They include an input layer for
feed input data, hidden layers for processing the inputdata.then output out of hidden layer
became input for the output layer for making final results. Neurons use weighted
connections and activation functions introduce non-linearities, capturing complex data
patterns. Its is repeated process. forward propagation, input data travels through layers, and
neurons calculate weighted sums, apply activation function then pass the information layer
by layer towards producing output. as well as In training process backpropagation model
modify the network through, gradient with respect to updating model weights for minimize
errors, then model make accurate predictions on new data after training
What is Gradient:
the gradient is a vector of partial derivatives of the function with respect to each variable.the
model trained by adjusting model parameters for minimize a predefined loss function that
indicates how the loss would change if the parameters are adjusted.
Produce:
Calculate the gradient of loss function with respect to each parameter. The gradient
indicates the direction in which the function is steepest.
Adjust the model parameters in the opposite direction of the gradient. Then it move
towards the minimum of the function
Repeat steps 2 and 3 until a stopping criterion is met, such as a predefined number of
iterations or when the change in the loss function becomes sufficiently small.
batch size is the number of data samples utilized in each iteration during the train model. It
influences the speed and stability of the training process, and helps manage memory
constrains. Common values for batch size include 32, 64, and 128, with smaller sizes
providing more frequent updates to the model and potentially faster convergence, while
largersizes may utilize computational resources more efficiently.
What is epoch
An epoch is making one complete round of the entire training dataset when the model in
training process. Then model updating model parameters based on the error calculated on
the entire training dataset. That’s improving the generalize and make accurate predictions
on unseen data.
What is Resize
Resize is adjusting images dimensions. Images maintain uniformity, helping model
cooperative and computational efficiency. Without resizing, models may face input size
inconsistencies, and difficulties in feature extraction.
what is Normalization:
scaling pixel values of images that will take into standardized range, often between 0 and 1.
each pixel value by the Dividing maximum possible value (255 for an 8-bit image) that
improve model convergence and numerical stability, during training. Without normalization,
there may be on impacting overall performance and stop the learning process.
Mini Bactch: A mini-batch is a subset of the training dataset that is used to compute the
gradient of the loss function when the train model
Internal Batch Normalization:
Applied within a neural network layer, normalizing the inputs during training to mitigate
internal covariate shift.
Commonly used in feedforward neural networks and convolutional neural networks (CNNs)
Normalization applied externally to the network, typically at the input or output of the entire
network.Useful in transfer learning
Without data augmentation, the model may overfit , and poor performance on new data
Transfer learning
is a technique apply on a pre-trained model is adapted for a new task. It is used for large
data set , enhance computational efficiency, and improve generalization by leveraging
knowledge (features, weights etc)gained from the source task.
Idea
TRADITIONAL VS TRANSFOR LEARNING
Image Classification:
Definition: Classifies an entire image into predefined classes.
Example: Identifying objects in an image, such as classifying a photo as
containing a cat or a dog.
Output: Single label or probability distribution over classes.
Object Detection:
Image Segmentation:
Back Propagation:
back progation during the training process uses a gradient with reapect to upadting
the parameter to bring the error to a minimum sate
convolutional layer
A convolutional layer is a building block of Convolutional Neural Networks (CNNs),
designed for processing and analyzing structured images. Convolutional layers are
particularly effective in capturing local patterns and hierarchies of features in the
input data.
It's commonly used in the final layers of neural networks for tasks like classification
and regression.
2.Dropout:
3.Sequential Model:
•It's suitable for building straightforward feedforward neural networks where each
layer has one input and one output.
4.AveragePooling2D:
6.Padding:
•Padding is the process of adding extra (typically zero) values around the input data
before applying convolution operations.
•Padding helps control the spatial dimensions of feature maps after convolution and
prevents information loss from the image boundaries.
7.Flattening:
•It's often used to connect convolutional or pooling layers to fully connected layers.
•Loss functions measure the error between predicted and actual values and guide the
optimization process during training.
•Mean Squared Error (MSE) is used for regression tasks, while Categorical Cross-
Entropy is common for classification tasks.
•The choice of loss function depends on the specific problem being addressed.
Activation Function(af)
Activation functions are mathematical operations applied to the output of each neuron in a
neural network. They introduce non-linearity, to learning complex patterns and relationships
in the data. Without activation functions, neural networks would reduce to linear
transformations, less ability to capture non-linear structures. Activation functions enable to
learning data during the backpropagation process, allowing the model to adjust weights and
optimize for improved performance.
Sigmoid Activation Function: The sigmoid activation function range input values
between 0 and 1, it suitable for binary classification tasks. It processes inputs ploting
them onto a sigmoid curve, producing outputs that represent probabilities.
commonly used in binary classification output layers, the sigmoid function may suffer
from the vanishing gradient problem when applied to hidden layers. Therefore, it is
not frequently used in hidden layers where the vanishing gradient issue can big
effective training.
input of the Sigma function is too small or two large then gradient become too small
and approaching the zero that means the Waves cannot be update in the back
propagation the weight can properly not properly updated it can't pass the
information to the another this list two fail the network is failed in network
ReLU Activation Function: The Rectified Linear Unit (ReLU) activation function input
range between 0 and 1. promoting sparsity and accelerating training in hidden layers
by setting negative inputs to zero and unchanged the positive value. It is commonly
used in hidden layers to address vanishing gradient issues,.
Tanh Activation Function: The hyperbolic tangent (tanh) activation function range
input values between -1 and 1, it require symmetric relationships. It processes inputs
by mapping them to the range (-1, 1) using a tanh curve. Tanh is suitable for where
outputs need to symmetry around zero, offering advantages in modeling symmetric
patterns.
Leaky ReLU Activation Function: The Leaky ReLU activation function extends ReLU
by allowing small negative values, mitigating the "dying ReLU" problem. It introduces
a small negative slope for negative inputs, preventing complete inactivation. Leaky
ReLU is often used in hidden layers of deep networks to address the issue of neurons
becoming inactive during training, thereby enhancing the model's capacity to
capture diverse features and patterns.
Used to Situation:
Adam Optimizer: The Adam optimizer, short for Adaptive Moment Estimation, is a
popular optimization algorithm that combines ideas from both momentum and
RMSprop. It adapts the learning rates for each parameter individually, providing
robustness to different types of gradients. Adam includes mechanisms to counteract
biases introduced in gradient estimation and is widely used in various deep learning
tasks due to its effectiveness and efficiency.
Adagrad Optimizer: The Adagrad optimizer adjusts the learning rates of model
parameters based on the historical gradient information. It accumulates squared
gradients in the denominator, allowing for larger updates for infrequent parameters.
While Adagrad is effective for sparse data, it can suffer from diminishing learning
rates over time, leading to slow convergence in certain scenarios.
Adamax Optimizer: The Adamax optimizer is a variant of Adam that uses the infinity
norm (maximum absolute value) of the past gradients. It has been introduced to
address potential issues in Adam where the L2 norm could become very small.
Adamax offers improved stability and is less sensitive to large gradients, making it
suitable for specific deep learning applications.
global minimum:
A local minimum of a function is a point where the function value is smaller than at nearby points,
but possibly greater than at a distant point.
A global minimum is a point where the function value is smaller than at all other feasible points.
Dying ReLU Problem: The dying ReLU problem refers to a situation where neurons
using the Rectified Linear Unit (ReLU) activation function always output zero for any
input, rendering them inactive. This can happen when the weights associated with a
ReLU neuron become negative and consistently output zero during forward
propagation. In such cases, the neuron fails to update its weights during
backpropagation, impeding the learning process. Techniques like Leaky ReLU, which
allows small negative values, are employed to mitigate the dying ReLU problem and
maintain the adaptability of the network during training.
Overfiting techniques
1. **Data Augmentation:**
- **Purpose:** Increases the diversity of the training set, helping the model become
more robust and preventing it from memorizing specific instances.
2. **Dropout:**
- **Definition:** Introduce penalty terms in the loss function that discourage large
weights (L2 regularization) or non-zero weights (L1 regularization).
- **Purpose:** Discourages the model from fitting noise in the training data and
encourages a simpler, more generalized model.
4. **Early Stopping:**
- **Purpose:** Prevents the model from continuing to learn the training data too
well, ensuring it does not overfit by halting training at an optimal point.
6. **Batch Normalization:**
- **Definition:** Split the dataset into multiple folds, train the model on different
subsets, and evaluate its performance on the remaining unseen data.
Pre_defined_algorithms
VGG is a convolutional neural network architecture known for its simplicity and
uniform structure. It consists of several convolutional layers followed by max-pooling
layers, and it achieved high performance in image classification tasks. VGG networks
come in different versions, such as VGG16 and VGG19, where the numbers indicate
the layers' depths.
**Inception (GoogLeNet):**
MobileNet is designed for mobile and edge devices, aiming for lightweight and
efficient neural network architectures. It utilizes depthwise separable convolutions to
reduce the number of parameters and computations while maintaining competitive
accuracy. MobileNet models are suitable for real-time applications with limited
computational resources.
CNN
A Convolutional Neural Network (CNN) is architecture designed for tasks like image
recognition. It contains convolutional layers that automatically learn hierarchical
features from input data, followed by pooling layers to reduce spatial dimensions.
Activation functions like ReLU introduce non-linearity, aiding in complex pattern
recognition. Fully connected layers capture global patterns, while normalization and
regularization layers enhance stability. CNNs use filters to extracting local features.
Striding and padding control spatial dimensions. Popular architectures like VGG,
ResNet, and Inception leverage variations of these components, making CNNs highly
effective for computer vision tasks.
Stride:
"stride" the step size that the convolutional filter takes when sliding over the input data
(image or feature map) during the convolution operation. The stride determines how much
the filter shifts at each step.
filter
filter is a small matrix used for the convolution operation. it is used to detect
patterns, features, within an input data, such as an image.
The filter is moved across the input data (e.g., an image) in a systematic way, and at
each position, it computes the dot product between its values and the values of the
input data
GAN:gan
Generative Adversarial Networks (GANs) it have a generator and a discriminator network.
The generator creates synthetic data, while the discriminator fine difference between real and
generated samples. The two networks are trained simultaneously through Opposing training
The process begins with the initialization of the generator and discriminator networks,
assigning them random weights. The generator then takes random noise as input and
produces synthetic data, such as images. Subsequently, the discriminator evaluates both real
data from the training set and the generated fake data, assigning probabilities to determine
their originality. The adversarial loss is calculated based on the generator's objective to create
data that matches the real data without difference and the discriminator's goal to accurately
classify real and fake samples. Gradients are then backpropagated through both networks,
updating their weights to minimize their respective losses. This iterative training process
repeats, enabling the generator to generate increasingly realistic data while the discriminator
enhances its ability to differentiate between real and fake samples. Ideally, the GAN
converges to a point where the generator produces data challenging for the discriminator
distinguish, resulting in the generation of realistic synthetic data. The trained GAN can be
employed to generate new data with characteristics similar to the training set, showcasing its
ability to capture and replicate underlying patterns. However, achieving this equilibrium can
be challenging, and GANs may encounter issues like mode collapse or instability during
training.
R-CNN
The loss calculation in Region-Based CNNs (R-CNN) plays a crucial role in training
the model to accurately localize and classify objects within images. The primary
components of the loss function include classification loss and bounding box
regression loss. The classification loss measures the disparity between predicted class
labels and actual labels for each region proposal. This ensures that the algorithm
correctly identifies the object categories. Simultaneously, the bounding box
regression loss evaluates the accuracy of predicted bounding box coordinates in
relation to the ground truth. These two components are typically combined, forming
a composite loss that the algorithm strives to minimize during training.
Faster_R-CNN
Faster R-CNN builds upon the foundations laid by R-CNN, introducing a key innovation to
enhance computational efficiency. This object detection algorithm integrates a Region
Proposal Network (RPN) directly into the Convolutional Neural Network (CNN) architecture,
eliminating the need for a separate region proposal mechanism like selective search. The
RPN efficiently generates region proposals on the CNN's feature map, evaluating potential
bounding boxes and assigning objectness scores. These proposals, along with their
associated scores, undergo feature extraction in the shared CNN, ensuring a consistent and
streamlined process. The features extracted are then employed for both object classification
and refinement of bounding box coordinates, following the principles of the original R-CNN.
The output of Faster R-CNN includes predicted class labels and refined bounding box
coordinates for each region proposal, providing accurate localization and classification of
objects within the image. This architectural refinement significantly improves processing
speed, making Faster R-CNN particularly suited for real-time applications without
compromising on detection precision.
Faced Problems
In object detection R-CNN, Fast CNN, Faster CNN models are faced these problems
includes
Network is too slow at inference time (i.e. when dealing with non-training data
SSD…
SSD utilizes VGG,Inception,Resnet architecture without its fully connected layers,.
SSD (Single Shot MultiBox Detector) Contains a backbone model and an SSD head. The
backbone act as a feature extractor.The SSD head, additional convolutional layers added to
the backbone, to identify bounding boxes and object classes in various locations within the
image.
Workflow in SSD:
FEATURE EXTRAACTORE
The feature extractor in SSD have a stack of convolutional layers followed by pooling layers.
This arrangement aims to extract a set of multi-scale feature maps in a hierarchical fashion
from the input image. here , lower layers focus on extracting edges and textures,
foundational elements of the image, while higher layers capturing more complex and
complex features objects in the image. The pooling layers, whether max pooling or average
pooling, play a crucial role. They downsample the spatial dimensions of the feature maps,
effectively reducing the resolution while retained the most important information.
SSD uses a set of anchor boxes of different aspect ratios and scales at each spatial location in the
feature maps.
These anchor boxes act as reference boxes that are placed throughout the image. They cover a range
of sizes and aspect ratios to handle variations in object appearance.
During training, the model predicts adjustments (offsets) to these anchor boxes, allowing them to
adapt to the specific locations and sizes of objects in the image.
BOUNDING BOX:
The model predicts bounding box adjustments for each anchor box at each spatial
location. These adjustments consist of offsets for the box's center coordinates, width,
and height.
The predicted adjustments are applied to the corresponding anchor boxes to obtain
the final bounding box predictions.
Bounding box adjustments allow the model to accurately localize objects, and class
scores are predicted for each adjusted bounding box to determine the likelihood of
an object belonging to a specific class.
Predicting HEAD
SSD has prediction heads attached to different layers of the network, each
responsible for predicting:
Bounding Box Offsets: Adjustments to the default anchor boxes to accurately fit the
target objects.
Class Scores: Confidence scores for each class, indicating the likelihood of an object
belonging to a particular category.
NMS
The predicted bounding boxes and class scores are used in post-processing steps, such as
Non-Maximum Suppression (NMS), to filter out redundant and overlapping detections.
Metrics:
Definition: IoU measures the overlap between the predicted bounding box and the
ground truth bounding box.
Use: Commonly used for evaluating the accuracy of object localization. IoU is
calculated as the intersection area divided by the union area of the two bounding
boxes.
Precision: Measures the accuracy of positive predictions, i.e., the ratio of correctly
predicted positive instances to the total predicted positives.
Recall (Sensitivity): Measures the ability of the model to capture all the relevant
instances, i.e., the ratio of correctly predicted positive instances to the total actual
positives.
F1 Score: The harmonic mean of precision and recall, providing a balance between
the two metrics.
Use: Evaluation of both localization and classification performance.
Average Precision: Computes the average precision-recall curve and calculates the
area under the curve (AUC).
Mean Average Precision: Averages the AP scores across different object classes.
YOLO
YOLO, which stands for "You Only Look Once," is a real-time object detection system that can
detect multiple objects in an image in a single forward pass of the neural network. YOLO was
introduced to address the trade-off between accuracy and speed in object detection tasks.
Input Image:
Convolutional Layers:
The image is passed through a series of convolutional layers. These layers learn hierarchical
features from the input image. Each layer captures different levels of abstraction, from edges
to more complex patterns. majority of convolution layers use their Leaky railu because it
preventive dead neurons and allow the small negative slope
The final layer, responsible for predicting class probabilities and bounding boxes, often uses
a linear activation function.
A linear activation function is used in the final layer to produce unbounded real values for
bounding box coordinates and confidence scores.
Grid Division:
The feature map resulting from the convolutional layers is divided into a grid. Each cell in the
grid is responsible for predicting bounding boxes and class probabilities for objects within its
region.
bounding box is describe the spatial location of an object or ground-truth object locations.
Each bounding box is represented by (x, y, w, h, confidence), where (x, y) are the coordinates
of the box's center, (w, h) are the width and height, and confidence is the confidence score.
Output Tensor:
The final output is a tensor of shape (S, S, B * (5 + C)), where S is the grid size, B is the
number of bounding boxes per grid cell, and C is the number of classes.
The tensor contains all the bounding box and class probability predictions for the entire grid.
Non-Maximum Suppression:
Non-maximum suppression ensures that only the most confident and non-overlapping
bounding boxes are retained.
Def/a/b:
Bounding boxes are provided as annotations in object detection datasets, representing
ground-truth object locations. Anchor boxes are used as templates for prediction during
model training and inference. The model predicts adjustments to anchor boxes to match
objects' actual locations