Deep Learning lab manual
Deep Learning lab manual
(R-20 Syllabus)
IV B.Tech.(IT) I SEM
1
LTPC
0 0 3 1.5
DEEP LEARNING LAB COURSE
OBJECTIVES:
1. To focus on gathering, pre-processing tabular, visual, textual and audio data
for building deep learning models using standard Python libraries.
2. To train, improve, and deploy deep learning models in different devices.
3. To analyze performance of different deep learning models using speed,
accuracy, size trade-offs.
List of Programs:
Week 1:
Basic image processing operations: Histogram equalization, thresholding, edge
detection, data augmentation, morphological operations
Week 2:
Implement SVM/Softmax classifier for CIFAR-10 dataset: (i) using KNN, (ii)
using 3 layer neural network
Week 3:
Study the effect of batch normalization and dropout in neural network
classifier.
Week 4:
Familiarization of image labelling tools for object detection, segmentation
Week 5:
Image segmentation using Mask RCNN, UNet, SegNet.
Week 6:
Object detection with single-stage and two-stage detectors (Yolo, SSD, FRCNN,
etc.)
Week 7:
Image Captioning with Vanilla RNNs
Week 8:
Image Captioning with LSTMs
Week 9:
Network Visualization: Saliency maps, Class Visualization
Week 10:
Generative Adversarial Networks
Week 11:
Chatbot using bi-directional LSTMs
Week 12:
Familiarization of cloud based computing like Google colab
2
COURSE OUTCOMES: After completion of the course, the student will be
able to
1. Understand the concepts of Object-Oriented Programming
2. Implement all operations on different linear data structures
3. Develop all operations on different non-linear data structures
4. Apply various searching techniques in real time scenarios
5. Apply various sorting techniques in real time scenarios
Text Book:
Deep Learning from Scratch: Building with Python from First Principles
Paperback – 16 September 2019.
Reference Books: 1. The Art of Computer Programming: Volume 1:
Fundamental Algorithms, Donald E. Knuth.
2. Introduction to Algorithms, Thomas, H. Cormen, Charles E. Leiserson,
Ronald L. Rivest, Clifford
Stein, The MIT Press.
3. Open Data Structures: An Introduction (Open Paths to Enriched Learning),
(Thirty First Edition), Pat Morin, UBC Press.
3
Week 1: Basic image processing operations: Histogram equalization, thresholding, edge
detection, data augmentation, morphological operations.
Histogram Equalization:
OpenCV has a function to do this, cv2.equalizeHist(). Its input is just grayscale image and output
is our histogram equalized image.
# import Opencv
import cv2
# import Numpy
import numpy as np
cv2.waitKey(0)
Output:
Thresholding:
4
Thresholding is a technique in OpenCV, which is the assignment of pixel values in relation to the
threshold value provided. In thresholding, each pixel value is compared with the threshold value. If
the pixel value is smaller than the threshold, it is set to 0, otherwise, it is set to a maximum value
(generally 255). Thresholding is a very popular segmentation technique, used for separating an
object considered as a foreground from its background. A threshold is a value which has two
regions on its either side i.e. below the threshold or above the threshold.
If f (x, y) < T
then f (x, y) = 0
else
f (x, y) = 255
where,
f (x, y) = Coordinate Pixel Value
T = Threshold Value.
cv2.THRESH_BINARY: If pixel intensity is greater than the set threshold, value set to 255, else
set to 0 (black).
cv2.THRESH_BINARY_INV: Inverted or Opposite case of cv2.THRESH_BINARY.
cv.THRESH_TRUNC: If pixel intensity value is greater than threshold, it is truncated to the
threshold. The pixel values are set to be the same as the threshold. All other values remain the
same.
cv.THRESH_TOZERO: Pixel intensity is set to 0, for all the pixels intensity, less than the
threshold value.
cv.THRESH_TOZERO_INV: Inverted or Opposite case of cv2.THRESH_TOZERO.
Output:
Edge Detection
Edge Detection, is an Image Processing discipline that incorporates mathematics methods to find
edges in a Digital Image. Edge Detection internally works by running a filter/Kernel over a Digital
Image, which detects discontinuities in Image regions like stark changes in brightness/Intensity
value of pixels. There are two forms of edge detection:
Output:
Sample Image:
Morphological Operations:
Morphological operations are used to extract image components that are useful in the
representation and description of region shape.
It needs two data sources, one is the input image, the second one is called structuring
component. Morphological operators take an input image and a structuring component as input
and these elements are then combines using the set operators.
# organizing imports
import cv2
import numpy as np
8
# Close the window / Release webcam
screenRead.release()
Output:
Data Augmentation:
Data augmentation is the process of increasing the amount and diversity of data. We do not collect
new data, rather we transform the already present data.
It helps us to increase the size of the dataset and introduce variability in the dataset.
Rotation
Shearing
Zooming
Cropping
Flipping
Changing the brightness level
Output:
1
0
1
1
Week 2: Implement SVM/Softmax classifier for CIFAR-10 dataset: (i) using KNN, (ii) using
3-layer neural network.
softmax.py
import numpy as np
class Softmax (object):
"""" Softmax classifier """
1
2
dW = np.zeros_like(self.W)
#############################################################################
# TODO: 20 points #
# - Compute the softmax loss and store to loss variable. #
# - Compute gradient and store to dW variable. #
# - Use L2 regularization #
# Bonus: #
# - +2 points if done without loop
#############################################################################
#Calculating loss for softmax
#calculate the score matrix
N = x.shape[0]
s =x.dot(self.W)
# calculating s-max(s)
s_ = s-np.max(s, axis=1, keepdims= True)
exp_s_ = np.exp(s_)
# calculating base
sum_f = np.sum(exp_s_, axis=1, keepdims=True)
# calculating probability of incorrect label by dividing by base
p = exp_s_/sum_f
p_yi= p[np.arange(N),y]
# Calculating loss by applying log over the probability
loss_i = - np.log(p_yi)
#keep as column vector
#TODO: add regularization
loss = np.sum(loss_i)/N
loss += reg * np.sum(self.W*self.W)
ds = p.copy()
ds[np.arange(x.shape[0]),y] += -1
dW = (x.T).dot(ds)/N
dW = dW + (2* reg* self.W)
return loss, dW
return lossHistory
return yPred
runsvmsoftmax.py
import os
import time
import numpy as np
##############################################################################
# SVM CLASSIFIER #
##############################################################################
from svm import Svm
numClasses = np.max(yTrain) + 1
1
6
print ('Start training Svm classifier')
# Training classifier
startTime = time.time()
classifier.train(xTrain, yTrain, lr=1e-7, reg=5e4, iter=1500 ,verbose=True)
print ('Training time: {0}'.format(time.time() - startTime))
# Tuneup hyper parameters (regularization strength, learning rate) by using validation data set,
# and random search technique to find the best set of parameters.
learn_rates = [0.5e-7, 1e-7, 2e-7, 6e-7]
reg_strengths = [500,5000,18000]
1
7
bestParameters = [0, 0]
bestAcc = -1
bestModel = None
print ('\nFinding best model for Svm classifier')
###############################################################################
#
# TODO: 5 points #
# Tuneup hyper parameters by using validation set. #
# - Store the best variables in parameters #
# - Store the best model in bestSoftmax #
# - Store the best accuracy in bestAcc #
###############################################################################
#
for rs in reg_strengths:
for lr in learn_rates:
#print(str(lr)+" "+str(rs))
classifier = Svm(xTrain.shape[1], numClasses)
classifier.train(xTrain, yTrain, lr, rs, iter=1500 ,verbose=False)
valAcc = classifier.calAccuracy(xVal, yVal)
if valAcc > bestAcc:
bestAcc = valAcc
bestModel = classifier
bestParameters = [lr,rs]
pass
##############################################################################
# SOFTMAX CLASSIFIER #
##############################################################################
from softmax import Softmax
numClasses = np.max(yTrain) + 1
print ('Start training Softmax classifier')
1
8
classifier = Softmax(xTrain.shape[1], numClasses)
# Training classifier
startTime = time.time()
classifier.train(xTrain, yTrain, lr=1e-7, reg=5e4, iter=1500 ,verbose=True)
print ('Training time: {0}'.format(time.time() - startTime))
# Tuneup hyper parameters (regularization strength, learning rate) by using validation data set,
# and random search technique to find the best set of parameters.
learn_rates = [0.5e-7,4e-7, 8e-7]
reg_strengths = [ 500,1500,7500,12000]
bestParameters = [0, 0]
bestAcc = -1
1
9
bestModel = None
print ('\nFinding best model for Softmax classifier')
###############################################################################
#
# TODO: 5 points
# Tuneup hyper parameters by using validation set.
# - Store the best variables in parameters
# - Store the best model in bestSoftmax
# - Store the best accuracy in bestAcc
###############################################################################
#
for rs in reg_strengths:
for lr in learn_rates:
classifier = Softmax(xTrain.shape[1], numClasses)
classifier.train(xTrain, yTrain, lr, rs, iter=1500 ,verbose=False)
valAcc = classifier.calAccuracy(xVal, yVal)
if valAcc > bestAcc:
bestAcc = valAcc
bestModel = classifier
bestParameters = [lr,rs]
import numpy as np
class Svm (object):
"""" Svm classifier """
2
0
def calLoss (self, x, y, reg):
"""
Svm loss function
D: Input dimension.
C: Number of Classes.
N: Number of example.
Inputs:
- x: A numpy array of shape (batchSize, D).
- y: A numpy array of shape (N,) where value < C.
- reg: (float) regularization strength.
Returns a tuple of:
- loss as single float.
- gradient with respect to weights self.W (dW) with the same shape of self.W.
"""
loss = 0.0
dW = np.zeros_like(self.W)
#############################################################################
# TODO: 20 points #
# - Compute the svm loss and store to loss variable. #
# - Compute gradient and store to dW variable. #
# - Use L2 regularization #
# Bonus: #
# - +2 points if done without loop #
#############################################################################
#Calculating score matrix
s = x.dot(self.W)
#Score with yi
s_yi = s[np.arange(x.shape[0]),y]
#finding the delta
delta = s- s_yi[:,np.newaxis]+1
#loss for samples
loss_i = np.maximum(0,delta)
loss_i[np.arange(x.shape[0]),y]=0
loss = np.sum(loss_i)/x.shape[0]
#Loss with regularization
loss += reg*np.sum(self.W*self.W)
#Calculating ds
ds = np.zeros_like(delta)
ds[delta > 0] = 1
ds[np.arange(x.shape[0]),y] = 0
ds[np.arange(x.shape[0]),y] = -np.sum(ds, axis=1)
dW = (1/x.shape[0]) * (x.T).dot(ds)
dW = dW + (2* reg* self.W)
return loss, dW
2
1
def train (self, x, y, lr=1e-3, reg=1e-5, iter=100, batchSize=200, verbose=False):
"""
Train this Svm classifier using stochastic gradient descent.
D: Input dimension.
C: Number of Classes.
N: Number of example.
Inputs:
- x: training data of shape (N, D)
- y: output data of shape (N, ) where value < C
- lr: (float) learning rate for optimization.
- reg: (float) regularization strength.
- iter: (integer) total number of iterations.
- batchSize: (integer) number of example in each batch running.
- verbose: (boolean) Print log of loss and training accuracy.
Outputs:
A list containing the value of the loss at each training iteration.
"""
2
2
return lossHistory
return yPred
return acc
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
%load_ext autoreload
2
3
%autoreload 2
def pretty_print(func):
source_code = inspect.getsourcelines(func)[0]
for line in source_code:
print(highlight(line.strip('\n'), PythonLexer(), Terminal256Formatter()), end='')
print('')
# Load the raw CIFAR-10 data.
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
# Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
try:
del X_train, y_train
del X_test, y_test
print('Clear previously loaded data.')
except:
pass
# As a sanity check, we print out the size of the training and test data.
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]
We would now like to classify the test data with the kNN classifier. Recall that we can break down
this process into two steps:
1. First we must compute the distances between all test examples and all train examples.
2. Given these distances, for each test example we find the k nearest examples and have them
vote for the label
2
5
# Open cs231n/classifiers/k_nearest_neighbor.py and implement
# compute_distances_two_loops.
# Print out implementation
pretty_print(classifier.compute_distances_two_loops)
Inputs:
- X: A numpy array of shape (num_test, D) containing test data.
Returns:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
is the Euclidean distance between the ith test point and the jth training
point.
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
for j in range(num_train):
dists[i, j] = np.linalg.norm(X[i]-self.X_train[j])
return dists
# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
plt.imshow(dists, interpolation='none')
plt.show()
Week 3: Study the effect of batch normalization and dropout in neural network classifier.
Batch Normalization:
A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its
own mean and standard deviation, and then also putting the data on a new scale with two trainable
2
6
rescaling parameters. Batchnorm, in effect, performs a kind of coordinated rescaling of its inputs.
Most often, batchnorm is added as an aid to the optimization process (though it can sometimes also
help prediction performance). Models with batchnorm tend to need fewer epochs to complete
training. Moreover, batchnorm can also fix various problems that can cause the training to get
"stuck".
layers.Dense(16),
layers.BatchNormalization(),
layers.Activation('relu'),
Drop Out:
overfitting is caused by the network learning spurious patterns in the training data. To recognize
these spurious patterns a network will often rely on very a specific combinations of weight, a kind
of "conspiracy" of weights. Being so specific, they tend to be fragile: remove one and the
conspiracy falls apart.
This is the idea behind dropout. To break up these conspiracies, we randomly drop out some
fraction of a layer's input units every step of training, making it much harder for the network to
learn those spurious patterns in the training data. Instead, it has to search for broad, general
patterns, whose weight patterns tend to be more robust.
Adding Dropout:
In Keras, the dropout rate argument rate defines what percentage of the input units to shut off. Put
the Dropout layer just before the layer you want the dropout applied to:
keras.Sequential([
# ...
layers.Dropout(rate=0.3), # apply 30% dropout to the next layer
layers.Dense(16),
# ...
])
model = keras.Sequential([
2
7
layers.Dense(1024, activation='relu', input_shape=[11]),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(1024, activation='relu'),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(1024, activation='relu'),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(1),
])
model.compile(
optimizer='adam',
loss='mae',
)
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=256,
epochs=100,
verbose=0,
)
# Show the learning curves
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();
Image segmentation and its different techniques, like region-based segmentation, edge detection
segmentation, and segmentation based on clustering.
2
8
Mask R-CNN:
Mask R-CNN is basically an extension of Faster R-CNN. Faster R-CNN is widely used for object
detection tasks. For a given image, it returns the class label and bounding box coordinates for each
object in the image.
Segmentation Mask:
Once we have the RoIs based on the IoU values, we can add a mask branch to the existing
architecture. This returns the segmentation mask for each region that contains an object. It returns a
mask of size 28 X 28 for each region which is then scaled up for inference.
Once this is done, we need to install the dependencies required by Mask R-CNN.
2
9
Step 2: Install the dependencies
Here is a list of all the dependencies for Mask R-CNN:
numpy
scipy
Pillow
cython
matplotlib
scikit-image
tensorflow>=1.3.0
keras>=2.0.8
opencv-python
h5py
imgaug
IPython
import os
import sys
import random
import math
import numpy as np
import skimage.io
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
class InferenceConfig(coco.CocoConfig):
# Set batch size to 1 since we'll be running inference on
# one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
GPU_COUNT = 1
IMAGES_PER_GPU = 1
config = InferenceConfig()
config.display()
Loading Weights:
Next, we will create our model and load the pretrained weights which we downloaded earlier.
Make sure that the pretrained weights are in the same folder as that of the notebook otherwise you
have to give the location of the weights file:
Now, we will define the classes of the COCO dataset which will help us in the prediction phase:
Let’s load an image and try to see how the model performs. You can use any of your images to test
the model.
# original image
plt.figure(figsize=(12,10))
skimage.io.imshow(image)
Making Predictions
# Run detection
results = model.detect([image], verbose=1)
# Visualize results
r = results[0]
visualize.display_instances(image, r['rois'], r['masks'], r['class_ids'], class_names, r['scores'])
3
2
I will first take all the masks predicted by our model and store them in the mask variable. Now,
these masks are in the boolean form (True and False) and hence we need to convert them to
numbers (1 and 0). Let’s do that first:
mask = r['masks']
mask = mask.astype(int)
mask.shape
Output:
(480,640,3)
This will give us an array of 0s and 1s, where 0 means that there is no object at that particular pixel
and 1 means that there is an object at that pixel.
To print or get each segment from the image, we will create a for loop and multiply each mask
with the original image to get each segment:
for i in range(mask.shape[2]):
temp = skimage.io.imread('sample.jpg')
for j in range(temp.shape[2]):
temp[:,:,j] = temp[:,:,j] * mask[:,:,i]
plt.figure(figsize=(8,8))
plt.imshow(temp)
3
3
Week 5: Image segmentation using Mask RCNN, UNet, SegNet.
Mask R-CNN (Regional Convolutional Neural Network) is an Instance segmentation model. In
this tutorial, we’ll see how to implement this in python with the help of the OpenCV library. If you
are interested in learning more about the inner-workings of this model, I’ve given a few links at the
reference section down below. That would help you understand the functionality of these models in
great detail.
import cv2
import os
import numpy as np
import random
import colorsys
import argparse
import time
from mrcnn import model as modellib
from mrcnn import visualize
from samples.coco.coco import CocoConfig
import matplotlib
class MyConfig(CocoConfig):
3
4
NAME = "my_coco_inference"
# Set batch size to 1 since we'll be running inference on one image at a time.
# Batch size = GPU_COUNT * IMAGES_PER_GPU
GPU_COUNT = 1
IMAGES_PER_GPU = 1
def prepare_mrcnn_model(model_path, model_name, class_names, my_config):
classes = open(class_names).read().strip().split("\n")
print("No. of classes", len(classes))
hsv = [(i / len(classes), 1, 1.0) for i in range(len(classes))]
COLORS = list(map(lambda c: colorsys.hsv_to_rgb(*c), hsv))
random.seed(42)
random.shuffle(COLORS)
model = modellib.MaskRCNN(mode="inference", model_dir=model_path, config=my_config)
model.load_weights(model_name, by_name=True)
return COLORS, model, classes
def custom_visualize(test_image, model, colors, classes, draw_bbox, mrcnn_visualize,
instance_segmentation):
detections = model.detect([test_image], verbose=1)[0]
if mrcnn_visualize:
matplotlib.use('TkAgg')
visualize.display_instances(test_image, detections['rois'], detections['masks'],
detections['class_ids'], classes, detections['scores'])
return
if instance_segmentation:
hsv = [(i / len(detections['rois']), 1, 1.0) for i in range(len(detections['rois']))]
colors = list(map(lambda c: colorsys.hsv_to_rgb(*c), hsv))
random.seed(42)
random.shuffle(colors)
for i in range(0, detections["rois"].shape[0]):
classID = detections["class_ids"][i]
mask = detections["masks"][:, :, i]
if instance_segmentation:
color = colors[i][::-1]
else:
color = colors[classID][::-1]
# To visualize the pixel-wise mask of the object
test_image = visualize.apply_mask(test_image, mask, color, alpha=0.5)
test_image = cv2.cvtColor(test_image, cv2.COLOR_RGB2BGR)
if draw_bbox:
for i in range(0, len(detections["scores"])):
(startY, startX, endY, endX) = detections["rois"][i]
classID = detections["class_ids"][i]
label = classes[classID]
score = detections["scores"][i]
if instance_segmentation:
color = [int(c) for c in np.array(colors[i]) * 255]
else:
3
5
color = [int(c) for c in np.array(colors[classID]) * 255]
cv2.rectangle(test_image, (startX, startY), (endX, endY), color, 2)
text = "{}: {:.2f}".format(label, score)
y = startY - 10 if startY - 10 > 10 else startY + 10
cv2.putText(test_image, text, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2)
return test_image
Week 6: Object detection with single stage and two stage detectors
Click here to download the source code to this post
In this tutorial, you’ll learn how to use the YOLO object detector to detect objects in both images
and video streams using Deep Learning, OpenCV, and Python.
By applying object detection, you’ll not only be able to determine what is in an image but also
where a given object resides!
We’ll start with a brief discussion of the YOLO object detector, including how the object detector
works.
From there we’ll use OpenCV, Python, and deep learning to:
Apply the YOLO object detector to images
Apply YOLO to video streams
We’ll wrap up the tutorial by discussing some of the limitations and drawbacks of the YOLO
object detector, including some of my personal tips and suggestions.
To learn how to use YOLO for object detection with OpenCV, just keep reading!
Program:
# import the necessary packages
import numpy as np
import argparse
import time
import cv2
import os
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
help="path to input image")
ap.add_argument("-y", "--yolo", required=True,
help="base path to YOLO directory")
ap.add_argument("-c", "--confidence", type=float, default=0.5,
help="minimum probability to filter weak detections")
ap.add_argument("-t", "--threshold", type=float, default=0.3,
3
6
help="threshold when applying non-maxima suppression")
args = vars(ap.parse_args())
# load the COCO class labels our YOLO model was trained on
labelsPath = os.path.sep.join([args["yolo"], "coco.names"])
LABELS = open(labelsPath).read().strip().split("\n")
# initialize a list of colors to represent each possible class label
np.random.seed(42)
COLORS = np.random.randint(0, 255, size=(len(LABELS), 3),
dtype="uint8")
# derive the paths to the YOLO weights and model configuration
weightsPath = os.path.sep.join([args["yolo"], "yolov3.weights"])
configPath = os.path.sep.join([args["yolo"], "yolov3.cfg"])
# load our YOLO object detector trained on COCO dataset (80 classes)
print("[INFO] loading YOLO from disk...")
net = cv2.dnn.readNetFromDarknet(configPath, weightsPath)
# load our input image and grab its spatial dimensions
image = cv2.imread(args["image"])
(H, W) = image.shape[:2]
# determine only the *output* layer names that we need from YOLO
ln = net.getLayerNames()
ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]
# construct a blob from the input image and then perform a forward
# pass of the YOLO object detector, giving us our bounding boxes and
# associated probabilities
blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416),
swapRB=True, crop=False)
net.setInput(blob)
start = time.time()
layerOutputs = net.forward(ln)
end = time.time()
# show timing information on YOLO
print("[INFO] YOLO took {:.6f} seconds".format(end - start))
# initialize our lists of detected bounding boxes, confidences, and
# class IDs, respectively
boxes = []
confidences = []
classIDs = []
# loop over each of the layer outputs
for output in layerOutputs:
# loop over each of the detections
for detection in output:
# extract the class ID and confidence (i.e., probability) of
# the current object detection
scores = detection[5:]
classID = np.argmax(scores)
confidence = scores[classID]
# filter out weak predictions by ensuring the detected
# probability is greater than the minimum probability
3
7
if confidence > args["confidence"]:
# scale the bounding box coordinates back relative to the
# size of the image, keeping in mind that YOLO actually
# returns the center (x, y)-coordinates of the bounding
# box followed by the boxes' width and height
box = detection[0:4] * np.array([W, H, W, H])
(centerX, centerY, width, height) = box.astype("int")
# use the center (x, y)-coordinates to derive the top and
# and left corner of the bounding box
x = int(centerX - (width / 2))
y = int(centerY - (height / 2))
# update our list of bounding box coordinates, confidences,
# and class IDs
boxes.append([x, y, int(width), int(height)])
confidences.append(float(confidence))
classIDs.append(classID)
# apply non-maxima suppression to suppress weak, overlapping bounding
# boxes
idxs = cv2.dnn.NMSBoxes(boxes, confidences, args["confidence"],
args["threshold"])
# ensure at least one detection exists
if len(idxs) > 0:
# loop over the indexes we are keeping
for i in idxs.flatten():
# extract the bounding box coordinates
(x, y) = (boxes[i][0], boxes[i][1])
(w, h) = (boxes[i][2], boxes[i][3])
# draw a bounding box rectangle and label on the image
color = [int(c) for c in COLORS[classIDs[i]]]
cv2.rectangle(image, (x, y), (x + w, y + h), color, 2)
text = "{}: {:.4f}".format(LABELS[classIDs[i]], confidences[i])
cv2.putText(image, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX,
0.5, color, 2)
# show the output image
cv2.imshow("Image", image)
cv2.waitKey(0)
python yolo.py --image images/baggage_claim.jpg --yolo yolo-coco
[INFO] loading YOLO from disk...
[INFO] YOLO took 0.347815 seconds
3
8
python yolo.py --image images/living_room.jpg --yolo yolo-coco
YOLO from disk...
YOLO took 0.340221 seconds
The key idea behind SSD is to perform object detection in a single shot, as opposed to two-stage
methods such as R-CNN and its variants, which use region proposals followed by classification. In
SSD, a convolutional neural network (CNN) is trained to predict object class scores and bounding
box offsets, directly from the feature maps generated by the base network.
One of the important innovations in SSD is the use of multiple feature maps with different
resolutions, which allows the network to handle objects of various sizes in an effective manner.
The network generates a set of default boxes over different aspect ratios and scales for each feature
map location and predicts class scores and bounding box offsets for each of these default boxes.
SSD combines predictions from multiple feature maps to achieve a balance between accuracy and
speed. It is computationally efficient, as it eliminates the bounding box proposals and subsequent
3
9
pixel or feature resampling stage. Additionally, SSD uses a small convolutional filter to predict
object categories and offsets in bounding box locations and applies these filters to multiple feature
maps from the later stages of the network to perform detection at multiple scales.
SO :
Single Shot: means that the tasks of object localization and classification are done in a single
forward pass of the network.
MultiBox: is the name of a technique for bounding box regression developed by Szegedy et al.
(we will briefly cover it shortly).
Detector: The network is an object detector that also classifies those detected objects.
SSD Architecture :
No alt text provided for this image
The SSD object detection composes of 2 parts:
Feature Extractor.
Objects Detector.
Multiboxes are like anchors of Fast R-CNN. We have multiple default boxes of different sizes, and
aspect ratios across the entire image as shown below. SSD uses 8732 boxes. This helps with
finding the default box that most overlaps with the ground truth bounding box containing objects.
MultiBox’s loss function combined two critical components that made their way into SSD:
Confidence Loss: this measures how confident the network is of the objectness of the computed
bounding box. Categorical cross-entropy is used to compute this loss.
No alt text provided for this image
Location Loss: this measures how far away the network’s predicted bounding boxes are from the
ground truth ones from the training set. L2-Norm is used here.
No alt text provided for this image
Total loss :
The alpha term helps us in balancing the contribution of the location loss.
No alt text provided for this image
Program:
conda create -n od python=3.9
conda activate od
pip install opencv-python numpy imutils
from imutils.video import FPS
import numpy as np
import imutils
import cv2
use_gpu = True
live_video = False
confidence_level = 0.5
fps = FPS().start()
ret = True
CLASSES = ["background", "aeroplane", "bicycle", "bird", "boat",
"bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
"dog", "horse", "motorbike", "person", "pottedplant", "sheep",
"sofa", "train", "tvmonitor"]
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))
net = cv2.dnn.readNetFromCaffe('ssd_files/MobileNetSSD_deploy.prototxt',
'ssd_files/MobileNetSSD_deploy.caffemodel')
if use_gpu:
4
0
print("[INFO] setting preferable backend and target to CUDA...")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
print("[INFO] accessing video stream...")
if live_video:
vs = cv2.VideoCapture(0)
else:
vs = cv2.VideoCapture('test.mp4')
while ret:
ret, frame = vs.read()
if ret:
frame = imutils.resize(frame, width=400)
(h, w) = frame.shape[:2]
blob = cv2.dnn.blobFromImage(frame, 0.007843, (300, 300), 127.5)
net.setInput(blob)
detections = net.forward()
for i in np.arange(0, detections.shape[2]):
confidence = detections[0, 0, i, 2]
if confidence > confidence_level:
idx = int(detections[0, 0, i, 1])
box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
(startX, startY, endX, endY) = box.astype("int")
label = "{}: {:.2f}%".format(CLASSES[idx], confidence * 100)
cv2.rectangle(frame, (startX, startY), (endX, endY), COLORS[idx], 2)
y = startY - 15 if startY - 15 > 15 else startY + 15
cv2.putText(frame, label, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5,
COLORS[idx], 2)
cv2.imshow('Live detection',frame)
if cv2.waitKey(1)==27:
break
fps.update()
fps.stop()
print("[INFO] elasped time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))
The main challenge of this task is to capture how objects relate to each other in the image and to
express them in a natural language (like English). Image captioning has a variety of uses, including
editing software recommendations, virtual assistants, image indexing, accessibility for visually
impaired people, social media, and other natural language processing applications.
4
1
1) Using pretrained CNN to extract image features. A pretrained VGG16 CNN will be used to
extract image features which will be concatenated with the RNN output.
2) Prepare training data. The training captions will be tokenized and embedded using the
GLOVE word embeddings. The embeddings will be fed into the RNN.
3) Model definition
4) Training the model
5) Generating novel image captions using the trained model. Test images and images from the
internet will be used as input to the trained model to generate captions. The captions will be
examined to determine the weaknesses of the model and suggest improvements.
6) Beam search. We will use beam search to generate better captions using the model.
7) Model evaluation. The model will be evaluated using the BLEU and ROUGE metric.
base_model = VGG16(include_top=True)
base_model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
4
2
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
flatten (Flatten) (None, 25088) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 102764544
_________________________________________________________________
fc2 (Dense) (None, 4096) 16781312
_________________________________________________________________
predictions (Dense) (None, 1000) 4097000
=================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
_________________________________________________________________
The feature extraction model will use the VGG16 input as model input. However, the second last
layer "fc2" of VGG16 will be used as the output of our extraction model. This is so because we do
not need the final softmax layer of VGG16.
4
3
model = Model(inputs=base_model.input, outputs=base_model.get_layer('fc2').output)
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
flatten (Flatten) (None, 25088) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 102764544
_________________________________________________________________
4
4
fc2 (Dense) (None, 4096) 16781312
=================================================================
Total params: 134,260,544
Trainable params: 134,260,544
Non-trainable params: 0
_________________________________________________________________
After the image model has been defined, we will use it to extract the features of all the images.
features = dict()
for file in listdir('Flicker8k_Dataset'):
img_path = 'Flicker8k_Dataset/' + file
img = load_img(img_path, target_size=(224, 224)) #size is 224,224 by default
x = img_to_array(img) #change to np array
x = np.expand_dims(x, axis=0) #expand to include batch dim at the beginning
x = preprocess_input(x) #make input confirm to VGG16 input format
fc2_features = model.predict(x)
dump(features, open('features.pkl', 'wb')) #cannot use JSON because ndarray is not JSON
serializable.
We first define a function that can load the training/test/dev ids that are stored in corresponding
files.
def load_data_set_ids(filename):
file = open(filename, 'r')
text = file.read()
file.close()
dataset = list()
for image_id in text.split('\n'):
if len(image_id) < 1:
4
5
continue
dataset.append(image_id)
return set(dataset)
training_set = load_data_set_ids('Flickr_8k.trainImages.txt')
dev_set = load_data_set_ids('Flickr_8k.devImages.txt')
test_set = load_data_set_ids('Flickr_8k.testImages.txt')
After the images for each set is identified, we clean up the captions by:
translator = str.maketrans("", "", string.punctuation) #translation table that maps all punctuation to
None
image_captions = dict()
image_captions_train = dict()
image_captions_dev = dict()
image_captions_test = dict()
image_captions_other = dict()
corpus = list() #corpus used to train tokenizer
corpus.extend(['<START>', '<END>', '<UNK>']) #add SOS and EOS to list first
max_imageCap_len = 0
image_cap = image_cap.split(' ') #split string here because following two methods works on
word-level best
image_cap = [w for w in image_cap if w.isalpha()] #keep only words that are all letters
image_cap = [w for w in image_cap if len(w)>1]
4
6
image_cap = '<START> ' + ' '.join(image_cap) + ' <END>' #add sentence start/end; note syntax:
separator.join()
#add to dictionary
if image_id not in image_captions:
image_captions[image_id] = list() #creat a new list if it does not yet exist
image_captions[image_id].append(image_cap)
fid = open("image_captions.pkl","wb")
dump(image_captions, fid)
fid.close()
fid = open("image_captions_train.pkl","wb")
dump(image_captions_train, fid)
fid.close()
fid = open("image_captions_dev.pkl","wb")
dump(image_captions_dev, fid)
fid.close()
4
7
fid = open("image_captions_test.pkl","wb")
dump(image_captions_test, fid)
fid.close()
fid = open("image_captions_other.pkl","wb")
dump(image_captions_other, fid)
fid.close()
fid = open("caption_train_tokenizer.pkl","wb")
dump(caption_train_tokenizer, fid)
fid.close()
fid = open("corpus.pkl","wb")
dump(corpus, fid)
fid.close()
corpus_count=Counter(corpus)
fid = open("corpus_count.pkl","wb")
dump(corpus_count, fid)
fid.close()
embeddings_index = dict()
fid = open('glove.6B.50d.txt' ,encoding="utf8")
for line in fid:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
fid.close()
EMBEDDING_DIM = 50
word_index = caption_train_tokenizer.word_index
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
fid = open("embedding_matrix.pkl","wb")
dump(embedding_matrix, fid)
fid.close()
When using the RNN as the languate model and a affine network to generate words, we need to
feed the already generated caption into the model and get the next word. Therefore, to generate a
caption of n words, the mode needs to run n+1 times (n words plus token). During training, we also
need to run the model n+1 times, and generate a separate training sequence for each run. There are
6000 images in the training data set, and 5 captions for each image. The maximum length of the
caption is 33 words. This comes to a maximum of 6000×5×33
or 990,000 training samples. To generate this many traning samples at the same time (keep in
mind we need to concatenate the images features to each sample too) would require a memory size
of at least 32GB.
Therefore, we will generate the training data on-the-fly, just before the model requires it. That is,
we will generate the training data one batch at a time, and then input the data into the model as
needed. This is often called progressive loading.
We first define a module that a caption of lenght n and generates the n+1 training data.
imageFeature_id = key.split('.')[0]
photo = photos[imageFeature_id][0]
in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc_list, photo,
vocab_size)
#in_img = np.squeeze(in_img)
X1.extend(in_img)
X2.extend(in_seq)
Y.extend(out_word)
current_batch_size += 1
if current_batch_size == batch_size:
current_batch_size = 0
yield [[np.array(X1), np.array(X2)], np.array(Y)]
We test our progressive-loading generator.
output:
(47, 4096)
(47, 33)
(47, 7057)
3 Model Definition
We are finally ready to define our model. We use the VGG16 model as our base model for the
5
0
CNN. We replace the last softmax layer freeze with another affine layer with 256 output and add a
dropout layer. The original layers of the VGG16 model is frozen. The image if input into the input
of the VGG16 layer. We take the GLOVE embedding and also freeze its parameters. The words
are fed as input to the embedding. The output of the embedding is fed into an LSTM RNN with
256 states. The output of the LSTM (256 dimensionss) and the output of the CNN (256
dimensions) is concatenated together to for a 512 dimensional input to a dense layer. The output of
the dense layer is fed into a softmax function.
fid = open("embedding_matrix.pkl","rb")
embedding_matrix = load(fid)
fid.close()
caption_max_length = 33
vocab_size = 7506
post_rnn_model_concat = define_model_concat(vocab_size, caption_max_length,
embedding_matrix)
_______________________________________________________________________________
___________________
Layer (type) Output Shape Param # Connected to
======================================================================
============================
input_9 (InputLayer) (None, 33) 0
5
1
_______________________________________________________________________________
___________________
input_8 (InputLayer) (None, 4096) 0
_______________________________________________________________________________
___________________
embedding_4 (Embedding) (None, 33, 50) 375300 input_9[0][0]
_______________________________________________________________________________
___________________
dropout_7 (Dropout) (None, 4096) 0 input_8[0][0]
_______________________________________________________________________________
___________________
dropout_8 (Dropout) (None, 33, 50) 0 embedding_4[0][0]
_______________________________________________________________________________
___________________
dense_10 (Dense) (None, 256) 1048832 dropout_7[0][0]
_______________________________________________________________________________
___________________
lstm_4 (LSTM) (None, 256) 314368 dropout_8[0][0]
_______________________________________________________________________________
___________________
concatenate_1 (Concatenate) (None, 512) 0 dense_10[0][0]
lstm_4[0][0]
_______________________________________________________________________________
___________________
dense_11 (Dense) (None, 256) 131328 concatenate_1[0][0]
_______________________________________________________________________________
___________________
dense_12 (Dense) (None, 7506) 1929042 dense_11[0][0]
======================================================================
============================
Total params: 3,798,870
Trainable params: 3,423,570
Non-trainable params: 375,300
4 Training Model
We use the progressive loading data generator to generate the training data on-the-fly. For each
batch, we generate training data from 6 images.
fid = open("features.pkl","rb")
image_features = load(fid)
fid.close()
fid = open("caption_train_tokenizer.pkl","rb")
caption_train_tokenizer = load(fid)
fid.close()
fid = open("image_captions_train.pkl","rb")
image_captions_train = load(fid)
5
2
fid.close()
fid = open("image_captions_dev.pkl","rb")
image_captions_dev = load(fid)
fid.close()
caption_max_length = 33
batch_size = 100
vocab_size = 7506
#generator = data_generator(image_captions_train, image_features, caption_train_tokenizer,
caption_max_length, batch_size, vocab_size)
#epochs = 2
#steps = len(image_captions_train)
#steps_per_epoch = np.floor(steps/batch_size)
batch_size = 6
steps = len(image_captions_train)
steps_per_epoch = np.floor(steps/batch_size)
epochs = 3
for i in range(epochs):
# create the data generator
generator = data_generator(image_captions_train, image_features, caption_train_tokenizer,
caption_max_length, batch_size, vocab_size)
# fit for one epoch
post_rnn_model_concat_hist=post_rnn_model_concat.fit_generator(generator, epochs=1,
steps_per_epoch=steps, verbose=1)
# save model
post_rnn_model_concat.save('modelConcat_1_' + str(i) + '.h5')
Epoch 1/1
6000/6000 [==============================] - 6933s 1s/step - loss: 3.8574 - acc: 0.2588
Epoch 1/1
6000/6000 [==============================] - 6904s 1s/step - loss: 3.0718 - acc: 0.3152
Epoch 1/1
6000/6000 [==============================] - 7606s 1s/step - loss: 2.8371 - acc: 0.3410
The training is terminated after 3 epochs. The loss was around 8 at beginning of the training
process. It quickly went down to 3 after 3 epochs. Training more epochs will further reduce loss.
base_model = VGG16(include_top=True)
feature_extract_pred_model = Model(inputs=base_model.input,
outputs=base_model.get_layer('fc2').output)
def extract_feature(model, file_name):
img = load_img(file_name, target_size=(224, 224)) #size is 224,224 by default
x = img_to_array(img) #change to np array
x = np.expand_dims(x, axis=0) #expand to include batch dim at the beginning
x = preprocess_input(x) #make input confirm to VGG16 input format
fc2_features = model.predict(x)
return fc2_features
# load the tokenizer
caption_train_tokenizer = load(open('caption_train_tokenizer.pkl', 'rb'))
# pre-define the max sequence length (from training)
max_length = 33
# load the model
#pred_model = load_model('model_3_0.h5')
pred_model = load_model('modelConcat_1a_2.h5')
To generate the caption, we first initialize the caption with the "START" token. We then input the
caption into the model, which will output the next word in the caption. The generated word will be
appended to the end of the caption and fed back into the model. The iterative process stops when
the "end" token is received.
Caption generation is a challenging artificial intelligence problem where a textual description must
be generated for a given photograph.
It requires both methods from computer vision to understand the content of the image and a
language
model from the field of natural language processing to turn the understanding of the image into
words
in the right order. Recently, deep learning methods have achieved state-of-the-art results on
examples
of this problem.
Deep learning methods have demonstrated state-of-the-art results on caption generation problems.
What is most impressive about these methods is a single end-to-end model can be defined to
predict a
caption, given a photo, instead of requiring sophisticated data preparation or a pipeline of
specifically
designed models.
import string
import os
import pandas as pd
from time import time
import cv2
from keras import Input, layers
from keras import optimizers
from keras.optimizers import Adam
from keras.preprocessing import sequence
from keras.preprocessing import image
5
5
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
descriptions['1000268201_693b08cb0e']
5
6
Week 9: Network visualization :Saliency maps, Class Visualization
The saliency map is a key theme in deep learning and computer vision. During the training of a deep
convolutional neural network, it becomes essential to know the feature map of every layer. The feature
maps of CNN tell us the learning characteristics of the model. Suppose we want to focus on a particular part
of an image than what concept will help us. Yes, it is a saliency map. It mainly focuses on specific pixels of
images while ignoring others.
We begin by creating a ResNet50 with ImageNet weights. With the simple helper functions, we import the
image on the disc and prepare it for feeding to the ResNet50.
# Import necessary packages
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
def input_img(path):
image = tf.image.decode_png(tf.io.read_file(path))
image = tf.expand_dims(image, axis=0)
image = tf.cast(image, tf.float32)
image = tf.image.resize(image, [224,224])
return image
def normalize_image(img):
grads_norm = img[:,:,0]+ img[:,:,1]+ img[:,:,2]
grads_norm = (grads_norm - tf.reduce_min(grads_norm))/ (tf.reduce_max(grads_norm)-
tf.reduce_min(grads_norm))
return grads_norm
def get_image():
import urllib.request
filename = 'image.jpg'
img_url = r"https://upload.wikimedia.org/wikipedia/commons/d/d7/White_stork_%28Ciconia_ciconia
%29_on_nest.jpg"
urllib.request.urlretrieve(img_url, filename)
def plot_maps(img1, img2,vmin=0.3,vmax=0.7, mix_val=2):
f = plt.figure(figsize=(15,45))
plt.subplot(1,3,1)
plt.imshow(img1,vmin=vmin, vmax=vmax, cmap="ocean")
plt.axis("off")
plt.subplot(1,3,2)
plt.imshow(img2, cmap = "ocean")
plt.axis("off")
plt.subplot(1,3,3)
plt.imshow(img1*mix_val+img2/mix_val, cmap = "ocean" )
plt.axis("off")
Input image
5
7
Prediction vector ResNet50 will be loaded from Keras applications directly.
test_model = tf.keras.applications.resnet50.ResNet50()
#test_model.summary()
get_image()
img_path = "image.jpg"
input_img = input_img(img_path)
input_img = tf.keras.applications.densenet.preprocess_input(input_img)
plt.imshow(normalize_image(input_img[0]), cmap = "ocean")
result = test_model(input_img)
max_idx = tf.argmax(result,axis = 1)
tf.keras.applications.imagenet_utils.decode_predictions(result.numpy())
A GradientTape function is available on TensorFlow 2.x that is capable of handling the backpropagation
related operations. Here, we will utilize the benefits of GradientTape to compute the saliency map of the
given image.
with tf.GradientTape() as tape:
tape.watch(input_img)
result = test_model(input_img)
max_score = result[0,max_idx[0]]
grads = tape.gradient(max_score, input_img)
plot_maps(normalize_image(grads[0]), normalize_image(input_img[0]))
To begin, you need to install torchvision in the activated gan conda environment:
5
9
transform:
train_set = torchvision.datasets.MNIST(
root=".", train=True, download=True, transform=transform
)
The argument download=True ensures that the first time you run the above code, the MNIST dataset will be
downloaded and stored in the current directory, as indicated by the argument root.
Now that you’ve created train_set, you can create the data loader as you did before:
batch_size = 32
train_loader = torch.utils.data.DataLoader(
train_set, batch_size=batch_size, shuffle=True
)
You can use Matplotlib to plot some samples of the training data. To improve the visualization, you can use
cmap=gray_r to reverse the color map and plot the digits in black over a white background:
6
0
nn.Linear(256, 512),
nn.ReLU(),
nn.Linear(512, 1024),
nn.ReLU(),
nn.Linear(1024, 784),
nn.Tanh(),
)
def forward(self, x):
output = self.model(x)
output = output.view(x.size(0), 1, 28, 28)
return output
generator = Generator().to(device=device)
training the model
lr = 0.0001
num_epochs = 50
loss_function = nn.BCELoss()
optimizer_discriminator = torch.optim.Adam(discriminator.parameters(), lr=lr)
optimizer_generator = torch.optim.Adam(generator.parameters(), lr=lr)
or epoch in range(num_epochs):
for n, (real_samples, mnist_labels) in enumerate(train_loader):
# Data for training the discriminator
real_samples = real_samples.to(device=device)
real_samples_labels = torch.ones((batch_size, 1)).to(
device=device
)
latent_space_samples = torch.randn((batch_size, 100)).to(
device=device
)
generated_samples = generator(latent_space_samples)
generated_samples_labels = torch.zeros((batch_size, 1)).to(
device=device
)
all_samples = torch.cat((real_samples, generated_samples))
all_samples_labels = torch.cat(
(real_samples_labels, generated_samples_labels)
)
# Training the discriminator
discriminator.zero_grad()
output_discriminator = discriminator(all_samples)
loss_discriminator = loss_function(
output_discriminator, all_samples_labels
)
loss_discriminator.backward()
optimizer_discriminator.step()
# Data for training the generator
latent_space_samples = torch.randn((batch_size, 100)).to(
device=device
)
6
1
output_discriminator_generated = discriminator(generated_samples)
loss_generator = loss_function(
output_discriminator_generated, real_samples_labels
)
loss_generator.backward()
optimizer_generator.step()
# Show loss
if n == batch_size - 1:
print(f"Epoch: {epoch} Loss D.: {loss_discriminator}")
print(f"Epoch: {epoch} Loss G.: {loss_generator}")
In the sentence “boys go to …..” we cannot fill the blank space. Still, when we have a future
sentence “boys come out of school”, we can easily predict the past blank space the similar thing we
want to perform by our model and bidirectional LSTM allows the neural network to perform this.
In the diagram, we can see the flow of information from backward and forward layers. BI-LSTM is
usually employed where the sequence to sequence tasks are needed. This kind of network can be
used in text classification, speech recognition and forecasting models. Next in the article, we are
going to make a bi-directional LSTM model using python.
Step 1: Import all the packages
import numpy as np
import tensorflow as tf
import pickle
from tensorflow.keras import layers, activations, models, preprocessing
6
2
Step 2: Download all the data from kaggle
!pip install kaggle
from google.colab import files
files.upload()
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d kausr25/chatterbotenglish
!unzip /content/chatterbotenglish.zip
!wget https://github.com/shubham0204/Dataset_Archives/blob/master/chatbot_nlp.zip?raw=true -
O chatbot_nlp.zip
!unzip chatbot_nlp.zip
For decoder_input_data: Tokensize the Answers and Pad them to their maximum Length.
For decoder_output_data: Tokensize the Answers and Remove the 1st element from all the
tokenized_answers. This is the element which was added earlier.
from gensim.models import Word2Vec
import re
vocab = []
for word in tokenizer.word_index:
vocab.append(word)
def tokenize(sentences):
tokens_list = []
vocabulary = []
for sentence in sentences:
sentence = sentence.lower()
sentence = re.sub('[^a-zA-Z]', ' ', sentence)
tokens = sentence.split()
vocabulary += tokens
tokens_list.append(tokens)
return tokens_list, vocabulary
#encoder_input_data
tokenized_questions = tokenizer.texts_to_sequences( questions )
maxlen_questions = max( [len(x) for x in tokenized_questions ] )
padded_questions = preprocessing.sequence.pad_sequences( tokenized_questions, maxlen =
maxlen_questions, padding = 'post')
encoder_input_data = np.array(padded_questions)
print (encoder_input_data.shape, maxlen_questions)
# decoder_input_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
maxlen_answers = max( [ len(x) for x in tokenized_answers ] )
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers ,
maxlen=maxlen_answers , padding='post' )
decoder_input_data = np.array( padded_answers )
print ( decoder_input_data.shape , maxlen_answers )
# decoder_output_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
6
4
for i in range(len(tokenized_answers)) :
tokenized_answers[i] = tokenized_answers[i][1:]
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers ,
maxlen=maxlen_answers , padding='post' )
onehot_answers = utils.to_categorical( padded_answers , VOCAB_SIZE )
decoder_output_data = np.array( onehot_answers )
print ( decoder_output_data.shape )
Step 4: Defining Encoder Decoder Model
encoder_inputs = tf.keras.layers.Input(shape=( maxlen_questions , ))
encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True )
(encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 200 , return_state=True )
( encoder_embedding )
encoder_states = [ state_h , state_c ]
def make_inference_models():
encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
decoder_state_input_h = tf.keras.layers.Input(shape=( 200 ,))
decoder_state_input_c = tf.keras.layers.Input(shape=( 200 ,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
decoder_embedding , initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
6
5
decoder_model = tf.keras.models.Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)
return encoder_model , decoder_model
Step 7: Talking with the Chatbot
define a method str_to_tokens which converts str questions to Integer tokens with padding.
First, we take a question as input and predict the state values using enc_model.
We set the state values in the decoder's LSTM.
Then, we generate a sequence which contains the element.
We input this sequence in the dec_model.
We replace the element with the element which was predicted by the dec_model and update the
state values.
We carry out the above steps iteratively till we hit the tag or the maximum answer length.
def str_to_tokens( sentence : str ):
words = sentence.lower().split()
tokens_list = list()
for word in words:
tokens_list.append( tokenizer.word_index[ word ] )
return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=maxlen_questions ,
padding='post')
enc_model , dec_model = make_inference_models()
for _ in range(10):
states_values = enc_model.predict( str_to_tokens( input( 'Enter question : ' ) ) )
empty_target_seq = np.zeros( ( 1 , 1 ) )
empty_target_seq[0, 0] = tokenizer.word_index['start']
stop_condition = False
decoded_translation = ''
while not stop_condition :
dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
sampled_word = None
for word , index in tokenizer.word_index.items() :
if sampled_word_index == index :
decoded_translation += ' {}'.format( word )
sampled_word = word
if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
stop_condition = TruE
empty_target_seq = np.zeros( ( 1 , 1 ) )
empty_target_seq[ 0 , 0 ] = sampled_word_index
states_values = [ h , c ]
print( decoded_translation )
Conversion to TFLite
We can convert our seq2seq model to a TensorFlow Lite model so that we can use it on edge
devices
!pip install tf-nightly
converter = tf.lite.TFLiteConverter.from_keras_model( enc_model )
buffer = converter.convert()
open( 'enc_model.tflite' , 'wb' ).write( buffer )
6
6
converter = tf.lite.TFLiteConverter.from_keras_model( dec_model )
open( 'dec_model.tflite' , 'wb' ).write( buffer )
You can create a new notebook file by clicking on NEW NOTEBOOK, but for now, close the pop
up by clicking on cancel or by clicking on the shaded area outside of the pop-up. Another way of
creating a new notebook is to click on the File tab (top left) -> New Notebook.
Notice that there are options for opening different files and uploading files, these will be used later.
Create a new Colab notebook following the steps provided in the above section.
Before we start getting into the coding, let’s familiarize ourselves with the user interface (UI) of
Google Colab.
What the different buttons mean:
1.Files: Here you will be able to upload datasets and other files from both your computer and
Google Drive
2.Code Snippets: Here you will be able to find prewritten snippets of code for different
functionalities like adding new libraries or referencing one cell from another.
3.Run Cell: This is the run button. Clicking this will run any code that is inserted in the cell beside
it. You can use the shortcut shift+enter to run the current cell and exit to a new one.
4.Table of Contents: Here you will be able to create and traverse different sections inside of your
notebook. Sections allow you to organize your code and improve readability.
5.Menu Bar: Like in any other application, this menu bar can be used to manipulate the entire file
or add new files. Look over the different tabs and familiarize yourself with the different options. In
particular, make sure you know how to upload or open a notebook and download the notebook (all
of these options are under “File”).
6.File Name: This is the name of your file. You can click on it to change the name. Do not edit the
extension (.ipynb) while editing the file name as this might make your file unopenable.
7.Insert Code Cell: This button will add a code cell below the cell you currently have selected.
8.Insert Text Cell: This button will add a text cell below the cell you currently have selected.
9.Cell: This is the cell. This is where you can write your code or add text depending on the type of
cell it is.
10.Output: This is the output of your code, including any errors, will be shown.
6
7
11.Clear Output: This button will remove the output.
6
8
6
9