0% found this document useful (0 votes)
39 views

Deep Learning For Computer Vision

The document discusses convolutional neural networks and their use in computer vision tasks like image classification. It describes the basic operations of CNNs including convolutions, feature maps, and max pooling. It also discusses techniques for training CNNs on small datasets including data augmentation, using pre-trained networks for feature extraction and fine-tuning.

Uploaded by

akram
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Deep Learning For Computer Vision

The document discusses convolutional neural networks and their use in computer vision tasks like image classification. It describes the basic operations of CNNs including convolutions, feature maps, and max pooling. It also discusses techniques for training CNNs on small datasets including data augmentation, using pre-trained networks for feature extraction and fine-tuning.

Uploaded by

akram
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Ch#5: Deep learning for

computer vision
Dr. Noman Islam
Introduction
• Convolutional neural networks, also known as
Convnets, is a type of deep-learning model
almost universally used in computer vision
applications
• Image classification is the task of classifying a
whole image as a single label
• For example, an image classification task could
label an image as a dog or a cat, given an image
is either a dog or a cat
Creating a ConvNet in Keras
Summary of the model
Convolution Operation
• The fundamental difference between a
densely connected layer and a convolution
layer is this:
– Dense layers learn global patterns in their input
feature space (for example, for a MNIST digit,
patterns involving all pixels),
– whereas convolution layers learn local patterns in
the case of images, patterns found in small 2D
windows of the inputs.
The patterns are translation invariant
• After learning a certain pattern in the lower-right
corner of a picture, a convnet can recognize it
anywhere: for example, in the upper-left corner.
• A densely connected network would have to learn
the pattern anew if it appeared at a new location.
• This makes convnets data efficient when processing
images (because the visual world is fundamentally
translation invariant): they need fewer training
samples to learn representations that have
generalization power.
Learn spatial hierarchies of patterns

• A first convolution layer will learn small local


patterns such as edges, a second convolution
layer will learn larger patterns made of the
features of the first layers, and so on.
• This allows convnets to efficiently learn
increasingly complex and abstract visual
concepts (because the visual world is
fundamentally spatially hierarchical).
Feature maps
• Convolutions operate over 3D tensors, called feature
maps, with two spatial axes (height and width) as well as a
depth axis (also called the channels axis).
• The convolution operation extracts patches from its input
feature map and applies the same transformation to all of
these patches, producing an output feature map.
• This output feature map is still a 3D tensor: it has a width
and a height.
• Its depth can be arbitrary, because the output depth is a
parameter of the layer
Feature map cont…
• Filters encode specific aspects of the input data: at a high
level, a single filter could encode the concept “presence
of a face in the input,” for instance.
• In the MNIST example, the first convolution layer takes a
feature map of size (28, 28, 1) and outputs a feature map
of size (26, 26, 32): it computes 32 filters over its input.
• Each of these 32 output channels contains a 26 × 26 grid
of values, which is a response map of the filter over the
input, indicating the response of that filter pattern at
different locations in the input
• Convolutions are defined by two key
parameters:
– Size of the patches extracted from the inputs—
These are typically 3 × 3 or 5 × 5. In the example,
they were 3 × 3, which is a common choice.
– Depth of the output feature map—The number of
filters computed by the convolution. The example
started with a depth of 32 and ended with a depth
of 64.
Convolution
• A convolution works by sliding these windows of
size 3 × 3 or 5 × 5 over the 3D input feature map,
stopping at every possible location, and extracting
the 3D patch of surrounding features (shape
(window_height, window_width, input_depth)).
• Each such 3D patch is then transformed (via a
tensor product with the same learned weight
matrix, called the convolution kernel) into a 1D
vector of shape (output_depth,).
• All of these vectors are then spatially
reassembled into a 3D output map of shape
(height, width, output_depth).
• Every spatial location in the output feature
map corresponds to the same location in the
input feature map (for example, the lower-
right corner of the output contains information
about the lower-right corner of the input).
• Note that the output width and height may
differ from the input width and height. They
may differ for two reasons:
– Border effects, which can be countered by
padding the input feature map
– The use of strides
Padding
• In Conv2D layers, padding is configurable via
the padding argument, which takes two
values: "valid", which means no padding (only
valid window locations will be used); and
"same", which means “pad in such a way as to
have an output with the same width and
height as the input.”
• The padding argument defaults to "valid".
Max pooling
• Max pooling consists of extracting windows
from the input feature maps and outputting
the max value of each channel.
• It’s conceptually similar to convolution, except
that instead of transforming local patches via a
learned linear transformation (the convolution
kernel), they’re transformed via a hardcoded
max tensor operation.
What is the problem?
• It isn’t conducive to learning a spatial hierarchy of
features. The 3 × 3 windows in the third layer will
only contain information coming from 7 × 7
windows in the initial input.
• The high-level patterns learned by the convnet will
still be very small with regard to the initial input,
which may not be enough to learn to classify digits
(try recognizing a digit by only looking at it through
windows that are 7 × 7 pixels!).
• We need the features from the last convolution
layer to contain information about the totality
of the input.
• The final feature map has 22 × 22 × 64 = 30,976
total coefficients per sample. This is huge. If you
were to flatten it to stick a Dense layer of size
512 on top, that layer would have 15.8 million
parameters. This is far too large for such a small
model and would result in intense overfitting.
Why Max pooling?
• The reason to use downsampling:
– is to reduce the number of feature-map
coefficients to process,
– as well as to induce spatial-filter hierarchies by
making successive convolution layers look at
increasingly large windows
Training a convnet from scratch on a small
dataset
• Having to train an image-classification model
using very little data is a common situation,
which you’ll likely encounter in practice if you
ever do computer vision in a professional
context.
• We’ll focus on classifying images as dogs or
cats, in a dataset containing 4,000 pictures of
cats and dogs (2,000 cats, 2,000 dogs).
• You’ll start by naively training a small convnet
on the 2,000 training samples, without any
regularization, to set a baseline for what can
be achieved
• Then we’ll introduce data augmentation, a
powerful technique for mitigating overfitting
in computer vision.
• We’ll review two more essential techniques
for applying deep learning to small datasets:
– feature extraction with a pretrained network
(which will get you to an accuracy of 90% to 96%)
– and fine-tuning a pretrained network (this will get
you to a final accuracy of 97%)
• Download the dataset from
www.kaggle .com/c/dogs-vs-cats/data
Image Data Generator
Data Augmentation
• Overfitting is caused by having too few
samples to learn from, rendering you unable to
train a model that can generalize to new data
• Data augmentation takes the approach of
generating more training data from existing
training samples, by augmenting the samples
via a number of random transformations that
yield believable-looking images
Data Augmentation
Adding a dropout layer to fight overfitting
Using a pre-trained ConvNet
• A pretrained network is a saved network that was
previously trained on a large dataset, typically on a large-
scale image-classification task.
• If this original dataset is large enough and general enough,
then the spatial hierarchy of features learned by the
pretrained network can effectively act as a generic model
of the visual world, and hence its features can prove useful
for many different computerv ision problems,
• Even though these new problems may involve completely
different classes than those of the original task.
Using a pre-trained model
• There are two ways to use a pretrained
network:
– feature extraction
– and fine-tuning
Feature Extraction
• Feature extraction consists of using the
representations learned by a previous network
to extract interesting features from new
samples.
• These features are then run through a new
classifier, which is trained from scratch
Feature Extraction
• Feature extraction consists of taking the
convolutional base
• Representations learned by the convolutional base are
likely to be more generic and therefore more reusable:
the feature maps of a convnet are presence maps of
generic concepts over a picture, which is likely to be
useful regardless of the computer-vision problem at hand.
• But the representations learned by the classifier will
necessarily be specific to the set of classes on which the
model was trained—they will only contain information
about the presence probability of this or that class in the
entire picture
• Layers that come earlier in the model extract
local, highly generic feature maps (such as
visual edges, colors, and textures), whereas
layers that are higher up extract more-abstract
concepts (such as “cat ear” or “dog eye”)
• The VGG16 model, among others, comes prepackaged with
Keras.
• You can import it from the keras.applications module.
• Here’s the list of image-classification models (all pretrained on
the ImageNet dataset) that are available as part of
keras .applications:
– Xception
– Inception V3
– ResNet50
– VGG16
– VGG19
– MobileNet
Feature Extraction without Data
Augmentation
Feature Extraction with Data
Augmentation
Fine Tuning
• Another widely used technique for model
reuse, complementary to feature extraction, is
fine-tuning
• Fine-tuning consists of unfreezing a few of the
top layers of a frozen model base used for
feature extraction, and jointly training both
the newly added part of the model and these
top layers.

You might also like