0% found this document useful (0 votes)
19 views

Deep Learning lab manual

Deep learning lab manual

Uploaded by

Varun Teja
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Deep Learning lab manual

Deep learning lab manual

Uploaded by

Varun Teja
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 69

DEPARTMENT OF IT

B V Raju Institute of Technology, (UGC


Autonomous) Vishnupur, Narsapur - 502 313Dist
Medak(T.S.)
DEEP LEARNING LAB MANUAL

(R-20 Syllabus)
IV B.Tech.(IT) I SEM

1
LTPC

0 0 3 1.5
DEEP LEARNING LAB COURSE
OBJECTIVES:
1. To focus on gathering, pre-processing tabular, visual, textual and audio data
for building deep learning models using standard Python libraries.
2. To train, improve, and deploy deep learning models in different devices.
3. To analyze performance of different deep learning models using speed,
accuracy, size trade-offs.
List of Programs:
Week 1:
Basic image processing operations: Histogram equalization, thresholding, edge
detection, data augmentation, morphological operations
Week 2:
Implement SVM/Softmax classifier for CIFAR-10 dataset: (i) using KNN, (ii)
using 3 layer neural network
Week 3:
Study the effect of batch normalization and dropout in neural network
classifier.
Week 4:
Familiarization of image labelling tools for object detection, segmentation
Week 5:
Image segmentation using Mask RCNN, UNet, SegNet.
Week 6:
Object detection with single-stage and two-stage detectors (Yolo, SSD, FRCNN,
etc.)
Week 7:
Image Captioning with Vanilla RNNs
Week 8:
Image Captioning with LSTMs
Week 9:
Network Visualization: Saliency maps, Class Visualization
Week 10:
Generative Adversarial Networks
Week 11:
Chatbot using bi-directional LSTMs
Week 12:
Familiarization of cloud based computing like Google colab
2
COURSE OUTCOMES: After completion of the course, the student will be
able to
1. Understand the concepts of Object-Oriented Programming
2. Implement all operations on different linear data structures
3. Develop all operations on different non-linear data structures
4. Apply various searching techniques in real time scenarios
5. Apply various sorting techniques in real time scenarios

Text Book:
Deep Learning from Scratch: Building with Python from First Principles
Paperback – 16 September 2019.
Reference Books: 1. The Art of Computer Programming: Volume 1:
Fundamental Algorithms, Donald E. Knuth.
2. Introduction to Algorithms, Thomas, H. Cormen, Charles E. Leiserson,
Ronald L. Rivest, Clifford
Stein, The MIT Press.
3. Open Data Structures: An Introduction (Open Paths to Enriched Learning),
(Thirty First Edition), Pat Morin, UBC Press.

3
Week 1: Basic image processing operations: Histogram equalization, thresholding, edge
detection, data augmentation, morphological operations.

Histogram Equalization:

OpenCV has a function to do this, cv2.equalizeHist(). Its input is just grayscale image and output
is our histogram equalized image.

# import Opencv
import cv2

# import Numpy
import numpy as np

# read a image using imread


img = cv2.imread(\'F:\\do_nawab.png\', 0)

# creating a Histograms Equalization of a image using cv2.equalizeHist()


equ = cv2.equalizeHist(img)

# stacking images side-by-side


res = np.hstack((img, equ))

# show image input vs output


cv2.imshow(\'image\', res)

cv2.waitKey(0)

Output:

Thresholding:

4
Thresholding is a technique in OpenCV, which is the assignment of pixel values in relation to the
threshold value provided. In thresholding, each pixel value is compared with the threshold value. If
the pixel value is smaller than the threshold, it is set to 0, otherwise, it is set to a maximum value
(generally 255). Thresholding is a very popular segmentation technique, used for separating an
object considered as a foreground from its background. A threshold is a value which has two
regions on its either side i.e. below the threshold or above the threshold.

If f (x, y) < T
then f (x, y) = 0
else
f (x, y) = 255

where,
f (x, y) = Coordinate Pixel Value
T = Threshold Value.

In OpenCV with Python, the function cv2.threshold is used for thresholding.

The different Simple Thresholding Techniques are:

cv2.THRESH_BINARY: If pixel intensity is greater than the set threshold, value set to 255, else
set to 0 (black).
cv2.THRESH_BINARY_INV: Inverted or Opposite case of cv2.THRESH_BINARY.
cv.THRESH_TRUNC: If pixel intensity value is greater than threshold, it is truncated to the
threshold. The pixel values are set to be the same as the threshold. All other values remain the
same.
cv.THRESH_TOZERO: Pixel intensity is set to 0, for all the pixels intensity, less than the
threshold value.
cv.THRESH_TOZERO_INV: Inverted or Opposite case of cv2.THRESH_TOZERO.

Python program to illustrate


# simple thresholding type on an image
# organizing imports
import cv2
import numpy as np

# path to input image is specified and


# image is loaded with imread command
image1 = cv2.imread('input1.jpg')

# cv2.cvtColor is applied over the


# image input with applied parameters
# to convert the image in grayscale
img = cv2.cvtColor(image1, cv2.COLOR_BGR2GRAY)

# applying different thresholding


# techniques on the input image
5
# all pixels value above 120 will
# be set to 255
ret, thresh1 = cv2.threshold(img, 120, 255, cv2.THRESH_BINARY)
ret, thresh2 = cv2.threshold(img, 120, 255, cv2.THRESH_BINARY_INV)
ret, thresh3 = cv2.threshold(img, 120, 255, cv2.THRESH_TRUNC)
ret, thresh4 = cv2.threshold(img, 120, 255, cv2.THRESH_TOZERO)
ret, thresh5 = cv2.threshold(img, 120, 255, cv2.THRESH_TOZERO_INV)

# the window showing output images


# with the corresponding thresholding
# techniques applied to the input images
cv2.imshow('Binary Threshold', thresh1)
cv2.imshow('Binary Threshold Inverted', thresh2)
cv2.imshow('Truncated Threshold', thresh3)
cv2.imshow('Set to 0', thresh4)
cv2.imshow('Set to 0 Inverted', thresh5)

# De-allocate any associated memory usage


if cv2.waitKey(0) & 0xff == 27:
cv2.destroyAllWindows()

Output:
Edge Detection

Edge Detection, is an Image Processing discipline that incorporates mathematics methods to find
edges in a Digital Image. Edge Detection internally works by running a filter/Kernel over a Digital
Image, which detects discontinuities in Image regions like stark changes in brightness/Intensity
value of pixels. There are two forms of edge detection:

Search Based Edge detection (First order derivative)


Zero Crossing Based Edge detection (Second order derivative)

Some of the commonly known edge detection methods are:

Laplacian Operator or Laplacian Based Edge detection (Second order derivative)


Canny edge detector (First order derivative)
Prewitt operator (First order derivative)
Sobel Operator (First order derivative)

pip install pillow

rom PIL import Image, ImageFilter

# Opening the image (R prefixed to string


# in order to deal with '\' in paths)
image = Image.open(r"Sample.png")
6
# Converting the image to grayscale, as edge detection
# requires input image to be of mode = Grayscale (L)
image = image.convert("L")

# Detecting Edges on the Image using the argument ImageFilter.FIND_EDGES


image = image.filter(ImageFilter.FIND_EDGES)

# Saving the Image Under the name Edge_Sample.png


image.save(r"Edge_Sample.png")

Output:
Sample Image:

Morphological Operations:

Morphological operations are used to extract image components that are useful in the
representation and description of region shape.
 It needs two data sources, one is the input image, the second one is called structuring
component. Morphological operators take an input image and a structuring component as input
and these elements are then combines using the set operators.

Syntax: cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)


Parameters:
-> image: Input Image array.
-> cv2.MORPH_OPEN: Applying the Morphological Opening operation.
-> kernel: Structuring element.

# Python program to illustrate


# Opening morphological operation
7
# on an image

# organizing imports
import cv2
import numpy as np

# return video from the first webcam on your computer.


screenRead = cv2.VideoCapture(0)

# loop runs if capturing has been initialized.


while(1):
# reads frames from a camera
_, image = screenRead.read()

# Converts to HSV color space, OCV reads colors as BGR


# frame is converted to hsv
hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)

# defining the range of masking


blue1 = np.array([110, 50, 50])
blue2 = np.array([130, 255, 255])

# initializing the mask to be


# convoluted over input image
mask = cv2.inRange(hsv, blue1, blue2)

# passing the bitwise_and over


# each pixel convoluted
res = cv2.bitwise_and(image, image, mask = mask)

# defining the kernel i.e. Structuring element


kernel = np.ones((5, 5), np.uint8)

# defining the opening function


# over the image and structuring element
opening = cv2.morphologyEx(mask, cv2.MORPH_OPEN, kernel)

# The mask and opening operation


# is shown in the window
cv2.imshow('Mask', mask)
cv2.imshow('Opening', opening)

# Wait for 'a' key to stop the program


if cv2.waitKey(1) & 0xFF == ord('a'):
break

# De-allocate any associated memory usage


cv2.destroyAllWindows()

8
# Close the window / Release webcam
screenRead.release()

Output:

Data Augmentation:

Data augmentation is the process of increasing the amount and diversity of data. We do not collect
new data, rather we transform the already present data.

1. Need for data augmentation


Data augmentation is an integral process in deep learning, as in deep learning we need large
amounts of data and in some cases it is not feasible to collect thousands or millions of images, so
data augmentation comes to the rescue.

It helps us to increase the size of the dataset and introduce variability in the dataset.

2. Operations in data augmentation


9
The most commonly used operations are-

 Rotation
 Shearing
 Zooming
 Cropping
 Flipping
 Changing the brightness level

Data augmentation in Keras


Keras is a high-level machine learning framework build on top of TensorFlow.
We can perform data augmentation by using the ImageDataGenerator class.
It takes in various arguments like – rotation_range, brightness_range, shear_range, zoom_range
etc.

We can perform data augmentation by using the ImageDataGenerator class.


It takes in various arguments like – rotation_range, brightness_range, shear_range, zoom_range
etc.

Data augmentation using Augmentor

# Importing necessary library


import Augmentor
# Passing the path of the image directory
p = Augmentor.Pipeline("image_folder")

# Defining augmentation parameters and generating 5 samples


p.flip_left_right(0.5)
p.black_and_white(0.1)
p.rotate(0.3, 10, 10)
p.skew(0.4, 0.5)
p.zoom(probability = 0.2, min_factor = 1.1, max_factor = 1.5)
p.sample(5)

Output:

1
0
1
1
Week 2: Implement SVM/Softmax classifier for CIFAR-10 dataset: (i) using KNN, (ii) using
3-layer neural network.

softmax.py

import numpy as np
class Softmax (object):
"""" Softmax classifier """

def __init__ (self, inputDim, outputDim):


self.W = None
#########################################################################
# TODO: 5 points #
# - Generate a random softmax weight matrix to use to compute loss. #
# with standard normal distribution and Standard deviation = 0.01. #
#########################################################################
sigma =0.01
self.W = sigma * np.random.randn(inputDim,outputDim)

def calLoss (self, x, y, reg):


"""
Softmax loss function
D: Input dimension.
C: Number of Classes.
N: Number of example.
Inputs:
- x: A numpy array of shape (batchSize, D).
- y: A numpy array of shape (N,) where value < C.
- reg: (float) regularization strength.
Returns a tuple of:
- loss as single float.
- gradient with respect to weights self.W (dW) with the same shape of self.W.
"""
loss = 0.0

1
2
dW = np.zeros_like(self.W)

#############################################################################
# TODO: 20 points #
# - Compute the softmax loss and store to loss variable. #
# - Compute gradient and store to dW variable. #
# - Use L2 regularization #
# Bonus: #
# - +2 points if done without loop

#############################################################################
#Calculating loss for softmax
#calculate the score matrix
N = x.shape[0]
s =x.dot(self.W)
# calculating s-max(s)
s_ = s-np.max(s, axis=1, keepdims= True)
exp_s_ = np.exp(s_)
# calculating base
sum_f = np.sum(exp_s_, axis=1, keepdims=True)
# calculating probability of incorrect label by dividing by base
p = exp_s_/sum_f
p_yi= p[np.arange(N),y]
# Calculating loss by applying log over the probability
loss_i = - np.log(p_yi)
#keep as column vector
#TODO: add regularization
loss = np.sum(loss_i)/N
loss += reg * np.sum(self.W*self.W)
ds = p.copy()
ds[np.arange(x.shape[0]),y] += -1
dW = (x.T).dot(ds)/N
dW = dW + (2* reg* self.W)

return loss, dW

def train (self, x, y, lr=1e-3, reg=1e-5, iter=100, batchSize=200, verbose=False):


"""
Train this Softmax classifier using stochastic gradient descent.
D: Input dimension.
C: Number of Classes.
N: Number of example.
Inputs:
- x: training data of shape (N, D)
- y: output data of shape (N, ) where value < C
- lr: (float) learning rate for optimization.
- reg: (float) regularization strength.
- iter: (integer) total number of iterations.
1
3
- batchSize: (integer) number of example in each batch running.
- verbose: (boolean) Print log of loss and training accuracy.
Outputs:
A list containing the value of the loss function at each training iteration.
"""

# Run stochastic gradient descent to optimize W.


lossHistory = []
for i in range(iter):
xBatch = None
yBatch = None
#########################################################################
# TODO: 10 points #
# - Sample batchSize from training data and save to xBatch and yBatch #
# - After sampling xBatch should have shape (D, batchSize) #
# yBatch (batchSize, ) #
# - Use that sample for gradient decent optimization. #
# - Update the weights using the gradient and the learning rate. #
# #
# Hint: #
# - Use np.random.choice #
#########################################################################
num_train = np.random.choice(x.shape[0], batchSize)
xBatch = x[num_train]
yBatch = y[num_train]
loss, dW = self.calLoss(xBatch,yBatch,reg)
self.W= self.W - lr * dW
lossHistory.append(loss)
# Print loss for every 100 iterations
if verbose and i % 100 == 0 and len(lossHistory) is not 0:
print ('Loop {0} loss {1}'.format(i, lossHistory[i]))

return lossHistory

def predict (self, x,):


"""
Predict the y output.
Inputs:
- x: training data of shape (N, D)
Returns:
- yPred: output data of shape (N, ) where value < C
"""
yPred = np.zeros(x.shape[0])
###########################################################################
# TODO: 5 points #
# - Store the predict output in yPred #
###########################################################################
s =x.dot(self.W)
1
4
yPred = np.argmax(s, axis=1)

return yPred

def calAccuracy (self, x, y):


acc = 0
###########################################################################
# TODO: 5 points #
# - Calculate accuracy of the predict value and store to acc variable #
###########################################################################
yPred = self.predict(x)
acc = np.mean(y == yPred)*100
return acc

runsvmsoftmax.py

import os
import time
import numpy as np

# Library for plot the output and save to file


import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt

# Load the CIFAR10 dataset


from keras.datasets import cifar10
baseDir = os.path.dirname(os.path.abspath('__file__')) + '/'
classesName = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
(xTrain, yTrain), (xTest, yTest) = cifar10.load_data()
xVal = xTrain[49000:, :].astype(np.float)
yVal = np.squeeze(yTrain[49000:, :])
xTrain = xTrain[:49000, :].astype(np.float)
yTrain = np.squeeze(yTrain[:49000, :])
yTest = np.squeeze(yTest)
xTest = xTest.astype(np.float)

# Show dimension for each variable


print ('Train image shape: {0}'.format(xTrain.shape))
print ('Train label shape: {0}'.format(yTrain.shape))
print ('Validate image shape: {0}'.format(xVal.shape))
print ('Validate label shape: {0}'.format(yVal.shape))
print ('Test image shape: {0}'.format(xTest.shape))
print ('Test label shape: {0}'.format(yTest.shape))

# Show some CIFAR10 images


plt.subplot(221)
1
5
plt.imshow(xTrain[0])
plt.axis('off')
plt.title(classesName[yTrain[0]])
plt.subplot(222)
plt.imshow(xTrain[1])
plt.axis('off')
plt.title(classesName[yTrain[1]])
plt.subplot(223)
plt.imshow(xVal[0])
plt.axis('off')
plt.title(classesName[yVal[1]])
plt.subplot(224)
plt.imshow(xTest[0])
plt.axis('off')
plt.title(classesName[yTest[0]])
plt.savefig(baseDir+'svm0.png')
plt.clf()

# Pre processing data


# Normalize the data by subtract the mean image
meanImage = np.mean(xTrain, axis=0)
xTrain -= meanImage
xVal -= meanImage
xTest -= meanImage

# Reshape data from channel to rows


xTrain = np.reshape(xTrain, (xTrain.shape[0], -1))
xVal = np.reshape(xVal, (xVal.shape[0], -1))
xTest = np.reshape(xTest, (xTest.shape[0], -1))

# Add bias dimension columns


xTrain = np.hstack([xTrain, np.ones((xTrain.shape[0], 1))])
xVal = np.hstack([xVal, np.ones((xVal.shape[0], 1))])
xTest = np.hstack([xTest, np.ones((xTest.shape[0], 1))])

print ('Train image shape after add bias column: {0}'.format(xTrain.shape))


print ('Val image shape after add bias column: {0}'.format(xVal.shape))
print ('Test image shape after add bias column: {0}'.format(xTest.shape))
print ('\
n##############################################################################
################')

##############################################################################
# SVM CLASSIFIER #
##############################################################################
from svm import Svm
numClasses = np.max(yTrain) + 1

1
6
print ('Start training Svm classifier')

classifier = Svm(xTrain.shape[1], numClasses)

# Show weight for each class before training


if classifier.W is not None:
tmpW = classifier.W[:-1, :]
tmpW = tmpW.reshape(32, 32, 3, 10)
tmpWMin, tmpWMax = np.min(tmpW), np.max(tmpW)
for i in range(numClasses):
plt.subplot(2, 5, i+1)
plt.title(classesName[i])
wPlot = 255.0 * (tmpW[:, :, :, i].squeeze() - tmpWMin) / (tmpWMax - tmpWMin)
plt.imshow(wPlot.astype('uint8'))
plt.savefig(baseDir+'svm1.png')
plt.clf()

# Training classifier
startTime = time.time()
classifier.train(xTrain, yTrain, lr=1e-7, reg=5e4, iter=1500 ,verbose=True)
print ('Training time: {0}'.format(time.time() - startTime))

# Calculate accuracy (Should get around this)


# Training acc: 37.61%
# Validating acc: 37.0%
# Testing acc: 37.38%
print ('Training acc: {0}%'.format(classifier.calAccuracy(xTrain, yTrain)))
print ('Validating acc: {0}%'.format(classifier.calAccuracy(xVal, yVal)))
print ('Testing acc: {0}%'.format(classifier.calAccuracy(xTest, yTest)))

# Show some weight for each class after training


if classifier.W is not None:
tmpW = classifier.W[:-1, :]
tmpW = tmpW.reshape(32, 32, 3, 10)
tmpWMin, tmpWMax = np.min(tmpW), np.max(tmpW)
for i in range(numClasses):
plt.subplot(2, 5, i+1)
plt.title(classesName[i])
# Scale weight to 0 - 255
wPlot = 255.0 * (tmpW[:, :, :, i].squeeze() - tmpWMin) / (tmpWMax - tmpWMin)
plt.imshow(wPlot.astype('uint8'))
plt.savefig(baseDir+'svm2.png')
plt.clf()

# Tuneup hyper parameters (regularization strength, learning rate) by using validation data set,
# and random search technique to find the best set of parameters.
learn_rates = [0.5e-7, 1e-7, 2e-7, 6e-7]
reg_strengths = [500,5000,18000]
1
7
bestParameters = [0, 0]
bestAcc = -1
bestModel = None
print ('\nFinding best model for Svm classifier')

###############################################################################
#
# TODO: 5 points #
# Tuneup hyper parameters by using validation set. #
# - Store the best variables in parameters #
# - Store the best model in bestSoftmax #
# - Store the best accuracy in bestAcc #
###############################################################################
#
for rs in reg_strengths:
for lr in learn_rates:
#print(str(lr)+" "+str(rs))
classifier = Svm(xTrain.shape[1], numClasses)
classifier.train(xTrain, yTrain, lr, rs, iter=1500 ,verbose=False)
valAcc = classifier.calAccuracy(xVal, yVal)
if valAcc > bestAcc:
bestAcc = valAcc
bestModel = classifier
bestParameters = [lr,rs]

pass

print ('Best validation accuracy: {0}'.format(bestAcc))

# Predict with best model


if bestModel is not None:
print ('Best Model parameter, lr = {0}, reg = {1}'.format(bestParameters[0], bestParameters[1]))
print ('Training acc: {0}%'.format(bestModel.calAccuracy(xTrain, yTrain)))
print ('Validating acc: {0}%'.format(bestModel.calAccuracy(xVal, yVal)))
print ('Testing acc: {0}%'.format(bestModel.calAccuracy(xTest, yTest)))
print ('\
n##############################################################################
################')
##############################################################################
# END OF SVM CLASSIFIER #
##############################################################################

##############################################################################
# SOFTMAX CLASSIFIER #
##############################################################################
from softmax import Softmax
numClasses = np.max(yTrain) + 1
print ('Start training Softmax classifier')
1
8
classifier = Softmax(xTrain.shape[1], numClasses)

# Show weight for each class before training


if classifier.W is not None:
tmpW = classifier.W[:-1, :]
tmpW = tmpW.reshape(32, 32, 3, 10)
tmpWMin, tmpWMax = np.min(tmpW), np.max(tmpW)
for i in range(numClasses):
plt.subplot(2, 5, i+1)
plt.title(classesName[i])
wPlot = 255.0 * (tmpW[:, :, :, i].squeeze() - tmpWMin) / (tmpWMax - tmpWMin)
plt.imshow(wPlot.astype('uint8'))
plt.savefig(baseDir+'softmax1.png')
plt.clf()

# Training classifier
startTime = time.time()
classifier.train(xTrain, yTrain, lr=1e-7, reg=5e4, iter=1500 ,verbose=True)
print ('Training time: {0}'.format(time.time() - startTime))

# Calculate accuracy (Should get around this)


# Training acc: 33.03%
# Validating acc: 33.7%
# Testing acc: 33.12%
print ('Training acc: {0}%'.format(classifier.calAccuracy(xTrain, yTrain)))
print ('Validating acc: {0}%'.format(classifier.calAccuracy(xVal, yVal)))
print ('Testing acc: {0}%'.format(classifier.calAccuracy(xTest, yTest)))

# Show some weight for each class after training


if classifier.W is not None:
tmpW = classifier.W[:-1, :]
tmpW = tmpW.reshape(32, 32, 3, 10)
tmpWMin, tmpWMax = np.min(tmpW), np.max(tmpW)
for i in range(numClasses):
plt.subplot(2, 5, i+1)
plt.title(classesName[i])
# Range 0 - 255
wPlot = 255.0 * (tmpW[:, :, :, i].squeeze() - tmpWMin) / (tmpWMax - tmpWMin)
plt.imshow(wPlot.astype('uint8'))
plt.savefig(baseDir+'softmax2.png')
plt.clf()

# Tuneup hyper parameters (regularization strength, learning rate) by using validation data set,
# and random search technique to find the best set of parameters.
learn_rates = [0.5e-7,4e-7, 8e-7]
reg_strengths = [ 500,1500,7500,12000]
bestParameters = [0, 0]
bestAcc = -1
1
9
bestModel = None
print ('\nFinding best model for Softmax classifier')
###############################################################################
#
# TODO: 5 points
# Tuneup hyper parameters by using validation set.
# - Store the best variables in parameters
# - Store the best model in bestSoftmax
# - Store the best accuracy in bestAcc
###############################################################################
#
for rs in reg_strengths:
for lr in learn_rates:
classifier = Softmax(xTrain.shape[1], numClasses)
classifier.train(xTrain, yTrain, lr, rs, iter=1500 ,verbose=False)
valAcc = classifier.calAccuracy(xVal, yVal)
if valAcc > bestAcc:
bestAcc = valAcc
bestModel = classifier
bestParameters = [lr,rs]

print ('Best validation accuracy: {0}'.format(bestAcc))

# Predict with best model


if bestModel is not None:
print ('Best Model parameter, lr = {0}, reg = {1}'.format(bestParameters[0], bestParameters[1]))
print ('Training acc: {0}%'.format(bestModel.calAccuracy(xTrain, yTrain)))
print ('Validating acc: {0}%'.format(bestModel.calAccuracy(xVal, yVal)))
print ('Testing acc: {0}%'.format(bestModel.calAccuracy(xTest, yTest)))
##############################################################################
# END OF SOFTMAX CLASSIFIER #
##############################################################################
svm.py

import numpy as np
class Svm (object):
"""" Svm classifier """

def __init__ (self, inputDim, outputDim):


self.W = None
#########################################################################
# TODO: 5 points #
# - Generate a random svm weight matrix to compute loss #
# with standard normal distribution and Standard deviation = 0.01. #
#########################################################################
sigma =0.01
self.W = sigma * np.random.randn(inputDim,outputDim)

2
0
def calLoss (self, x, y, reg):
"""
Svm loss function
D: Input dimension.
C: Number of Classes.
N: Number of example.
Inputs:
- x: A numpy array of shape (batchSize, D).
- y: A numpy array of shape (N,) where value < C.
- reg: (float) regularization strength.
Returns a tuple of:
- loss as single float.
- gradient with respect to weights self.W (dW) with the same shape of self.W.
"""
loss = 0.0
dW = np.zeros_like(self.W)

#############################################################################
# TODO: 20 points #
# - Compute the svm loss and store to loss variable. #
# - Compute gradient and store to dW variable. #
# - Use L2 regularization #
# Bonus: #
# - +2 points if done without loop #

#############################################################################
#Calculating score matrix
s = x.dot(self.W)
#Score with yi
s_yi = s[np.arange(x.shape[0]),y]
#finding the delta
delta = s- s_yi[:,np.newaxis]+1
#loss for samples
loss_i = np.maximum(0,delta)
loss_i[np.arange(x.shape[0]),y]=0
loss = np.sum(loss_i)/x.shape[0]
#Loss with regularization
loss += reg*np.sum(self.W*self.W)
#Calculating ds
ds = np.zeros_like(delta)
ds[delta > 0] = 1
ds[np.arange(x.shape[0]),y] = 0
ds[np.arange(x.shape[0]),y] = -np.sum(ds, axis=1)

dW = (1/x.shape[0]) * (x.T).dot(ds)
dW = dW + (2* reg* self.W)

return loss, dW
2
1
def train (self, x, y, lr=1e-3, reg=1e-5, iter=100, batchSize=200, verbose=False):
"""
Train this Svm classifier using stochastic gradient descent.
D: Input dimension.
C: Number of Classes.
N: Number of example.
Inputs:
- x: training data of shape (N, D)
- y: output data of shape (N, ) where value < C
- lr: (float) learning rate for optimization.
- reg: (float) regularization strength.
- iter: (integer) total number of iterations.
- batchSize: (integer) number of example in each batch running.
- verbose: (boolean) Print log of loss and training accuracy.
Outputs:
A list containing the value of the loss at each training iteration.
"""

# Run stochastic gradient descent to optimize W.


lossHistory = []
for i in range(iter):
xBatch = None
yBatch = None
#########################################################################
# TODO: 10 points #
# - Sample batchSize from training data and save to xBatch and yBatch #
# - After sampling xBatch should have shape (batchSize, D) #
# yBatch (batchSize, ) #
# - Use that sample for gradient decent optimization. #
# - Update the weights using the gradient and the learning rate. #
# #
# Hint: #
# - Use np.random.choice #
#########################################################################
#creating batch
num_train = np.random.choice(x.shape[0], batchSize)
xBatch = x[num_train]
yBatch = y[num_train]
loss, dW = self.calLoss(xBatch,yBatch,reg)
self.W= self.W - lr * dW
lossHistory.append(loss)

# Print loss for every 100 iterations


if verbose and i % 100 == 0 and len(lossHistory) is not 0:
print ('Loop {0} loss {1}'.format(i, lossHistory[i]))

2
2
return lossHistory

def predict (self, x,):


"""
Predict the y output.
Inputs:
- x: training data of shape (N, D)
Returns:
- yPred: output data of shape (N, ) where value < C
"""
yPred = np.zeros(x.shape[0])
###########################################################################
# TODO: 5 points #
# - Store the predict output in yPred #
###########################################################################
s = x.dot(self.W)
yPred = np.argmax(s, axis=1)

return yPred

def calAccuracy (self, x, y):


acc = 0
###########################################################################
# TODO: 5 points #
# - Calculate accuracy of the predict value and store to acc variable #
###########################################################################
yPred = self.predict(x)
acc = np.mean(y == yPred)*100

return acc

Building KNN model:

import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

from __future__ import print_function

# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
%load_ext autoreload
2
3
%autoreload 2

# Define function for inspecting the source code of a function


import inspect
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import Terminal256Formatter

def pretty_print(func):
source_code = inspect.getsourcelines(func)[0]
for line in source_code:
print(highlight(line.strip('\n'), PythonLexer(), Terminal256Formatter()), end='')
print('')
# Load the raw CIFAR-10 data.
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'

# Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
try:
del X_train, y_train
del X_test, y_test
print('Clear previously loaded data.')
except:
pass

X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# As a sanity check, we print out the size of the training and test data.
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

# Visualize some examples from the dataset.


# We show a few examples of training images from each class.
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
idxs = np.flatnonzero(y_train == y)
idxs = np.random.choice(idxs, samples_per_class, replace=False)
for i, idx in enumerate(idxs):
plt_idx = i * num_classes + y + 1
plt.subplot(samples_per_class, num_classes, plt_idx)
plt.imshow(X_train[idx].astype('uint8'))
plt.axis('off')
if i == 0:
plt.title(cls)
plt.show()
2
4
# Subsample the data for more efficient code execution in this exercise
num_training = 5000
mask = list(range(num_training))
X_train = X_train[mask]
y_train = y_train[mask]

num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]

# Reshape the image data into rows


X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)

from cs231n.classifiers import KNearestNeighbor

# Create a kNN classifier instance.


# Remember that training a kNN classifier is a noop:
# the Classifier simply remembers the data and does no further processing
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)

We would now like to classify the test data with the kNN classifier. Recall that we can break down
this process into two steps:
1. First we must compute the distances between all test examples and all train examples.
2. Given these distances, for each test example we find the k nearest examples and have them
vote for the label

2
5
# Open cs231n/classifiers/k_nearest_neighbor.py and implement
# compute_distances_two_loops.
# Print out implementation
pretty_print(classifier.compute_distances_two_loops)

# Test your implementation:


dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

def compute_distances_two_loops(self, X):


"""
Compute the distance between each test point in X and each training point
in self.X_train using a nested loop over both the training data and the
test data.

Inputs:
- X: A numpy array of shape (num_test, D) containing test data.

Returns:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
is the Euclidean distance between the ith test point and the jth training
point.
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
for j in range(num_train):
dists[i, j] = np.linalg.norm(X[i]-self.X_train[j])

return dists

# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
plt.imshow(dists, interpolation='none')
plt.show()

Week 3: Study the effect of batch normalization and dropout in neural network classifier.

Batch Normalization:

A batch normalization layer looks at each batch as it comes in, first normalizing the batch with its
own mean and standard deviation, and then also putting the data on a new scale with two trainable
2
6
rescaling parameters. Batchnorm, in effect, performs a kind of coordinated rescaling of its inputs.
Most often, batchnorm is added as an aid to the optimization process (though it can sometimes also
help prediction performance). Models with batchnorm tend to need fewer epochs to complete
training. Moreover, batchnorm can also fix various problems that can cause the training to get
"stuck".

Adding Batch Normalization:...


It seems that batch normalization can be used at almost any point in a network. You can put it after
a layer ...
layers.Dense(16, activation='relu'),
layers.BatchNormalization(),

or between a layer and its activation function:

layers.Dense(16),
layers.BatchNormalization(),
layers.Activation('relu'),

Drop Out:

overfitting is caused by the network learning spurious patterns in the training data. To recognize
these spurious patterns a network will often rely on very a specific combinations of weight, a kind
of "conspiracy" of weights. Being so specific, they tend to be fragile: remove one and the
conspiracy falls apart.

This is the idea behind dropout. To break up these conspiracies, we randomly drop out some
fraction of a layer's input units every step of training, making it much harder for the network to
learn those spurious patterns in the training data. Instead, it has to search for broad, general
patterns, whose weight patterns tend to be more robust.

Adding Dropout:

In Keras, the dropout rate argument rate defines what percentage of the input units to shut off. Put
the Dropout layer just before the layer you want the dropout applied to:
keras.Sequential([
# ...
layers.Dropout(rate=0.3), # apply 30% dropout to the next layer
layers.Dense(16),
# ...
])

Example - Using Dropout and Batch Normalization:

from tensorflow import keras


from tensorflow.keras import layers

model = keras.Sequential([
2
7
layers.Dense(1024, activation='relu', input_shape=[11]),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(1024, activation='relu'),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(1024, activation='relu'),
layers.Dropout(0.3),
layers.BatchNormalization(),
layers.Dense(1),
])

model.compile(
optimizer='adam',
loss='mae',
)

history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=256,
epochs=100,
verbose=0,
)
# Show the learning curves
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();

Week 4: Familiarization of image labelling tools for object detection, segmentation.

Image segmentation and its different techniques, like region-based segmentation, edge detection
segmentation, and segmentation based on clustering.

2
8
Mask R-CNN:

Mask R-CNN is basically an extension of Faster R-CNN. Faster R-CNN is widely used for object
detection tasks. For a given image, it returns the class label and bounding box coordinates for each
object in the image.

Segmentation Mask:
Once we have the RoIs based on the IoU values, we can add a mask branch to the existing
architecture. This returns the segmentation mask for each region that contains an object. It returns a
mask of size 28 X 28 for each region which is then scaled up for inference.

The Segmentation mask for this image would be:

Steps to implement Mask R-CNN:


Step 1: Clone the repository
First, we will clone the mask rcnn repository which has the architecture for Mask R-CNN. Use the
following command to clone the repository:

git clone https://github.com/matterport/Mask_RCNN.git

Once this is done, we need to install the dependencies required by Mask R-CNN.

2
9
Step 2: Install the dependencies
Here is a list of all the dependencies for Mask R-CNN:

numpy
scipy
Pillow
cython
matplotlib
scikit-image
tensorflow>=1.3.0
keras>=2.0.8
opencv-python
h5py
imgaug
IPython

Step 3: Download the pre-trained weights (trained on MS COCO)


we need to download the pretrained weights.These weights are obtained from a model that was
trained on the MS COCO dataset. Once you have downloaded the weights, paste this file in the
samples folder of the Mask_RCNN repository that we cloned in step 1.

Step 4: Predicting for our image


Finally, we will use the Mask R-CNN architecture and the pretrained weights to generate
predictions for our own images.

Implementing Mask R-CNN in Python:

import os
import sys
import random
import math
import numpy as np
import skimage.io
import matplotlib
import matplotlib.pyplot as plt

# Root directory of the project


ROOT_DIR = os.path.abspath("../")

import warnings
warnings.filterwarnings("ignore")

# Import Mask RCNN


sys.path.append(ROOT_DIR) # To find local version of the library
from mrcnn import utils
import mrcnn.model as modellib
from mrcnn import visualize
# Import COCO config
3
0
sys.path.append(os.path.join(ROOT_DIR, "samples/coco/")) # To find local version
import coco
%matplotlib inline

# Directory to save logs and trained model


MODEL_DIR = os.path.join(ROOT_DIR, "logs")

# Local path to trained weights file


COCO_MODEL_PATH = os.path.join('', "mask_rcnn_coco.h5")

# Download COCO trained weights from Releases if needed


if not os.path.exists(COCO_MODEL_PATH):
utils.download_trained_weights(COCO_MODEL_PATH)

# Directory of images to run detection on


IMAGE_DIR = os.path.join(ROOT_DIR, "images")

class InferenceConfig(coco.CocoConfig):
# Set batch size to 1 since we'll be running inference on
# one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
GPU_COUNT = 1
IMAGES_PER_GPU = 1

config = InferenceConfig()
config.display()

Loading Weights:
Next, we will create our model and load the pretrained weights which we downloaded earlier.
Make sure that the pretrained weights are in the same folder as that of the notebook otherwise you
have to give the location of the weights file:

# Create model object in inference mode.


model = modellib.MaskRCNN(mode="inference", model_dir='mask_rcnn_coco.hy',
config=config)

# Load weights trained on MS-COCO


model.load_weights('mask_rcnn_coco.h5', by_name=True)

Now, we will define the classes of the COCO dataset which will help us in the prediction phase:

# COCO Class names


class_names = ['BG', 'person', 'bicycle', 'car', 'motorcycle', 'airplane',
'bus', 'train', 'truck', 'boat', 'traffic light',
'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird',
'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear',
'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie',
'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
3
1
'kite', 'baseball bat', 'baseball glove', 'skateboard',
'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',
'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed',
'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
'keyboard', 'cell phone', 'microwave', 'oven', 'toaster',
'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
'teddy bear', 'hair drier', 'toothbrush']

Let’s load an image and try to see how the model performs. You can use any of your images to test
the model.

# Load a random image from the images folder


image = skimage.io.imread('sample.jpg')

# original image
plt.figure(figsize=(12,10))
skimage.io.imshow(image)

Making Predictions

# Run detection
results = model.detect([image], verbose=1)

# Visualize results
r = results[0]
visualize.display_instances(image, r['rois'], r['masks'], r['class_ids'], class_names, r['scores'])

3
2
I will first take all the masks predicted by our model and store them in the mask variable. Now,
these masks are in the boolean form (True and False) and hence we need to convert them to
numbers (1 and 0). Let’s do that first:

mask = r['masks']
mask = mask.astype(int)
mask.shape

Output:

(480,640,3)

This will give us an array of 0s and 1s, where 0 means that there is no object at that particular pixel
and 1 means that there is an object at that pixel.

To print or get each segment from the image, we will create a for loop and multiply each mask
with the original image to get each segment:

for i in range(mask.shape[2]):
temp = skimage.io.imread('sample.jpg')
for j in range(temp.shape[2]):
temp[:,:,j] = temp[:,:,j] * mask[:,:,i]
plt.figure(figsize=(8,8))
plt.imshow(temp)

3
3
Week 5: Image segmentation using Mask RCNN, UNet, SegNet.
Mask R-CNN (Regional Convolutional Neural Network) is an Instance segmentation model. In
this tutorial, we’ll see how to implement this in python with the help of the OpenCV library. If you
are interested in learning more about the inner-workings of this model, I’ve given a few links at the
reference section down below. That would help you understand the functionality of these models in
great detail.
import cv2
import os
import numpy as np
import random
import colorsys
import argparse
import time
from mrcnn import model as modellib
from mrcnn import visualize
from samples.coco.coco import CocoConfig
import matplotlib
class MyConfig(CocoConfig):
3
4
NAME = "my_coco_inference"
# Set batch size to 1 since we'll be running inference on one image at a time.
# Batch size = GPU_COUNT * IMAGES_PER_GPU
GPU_COUNT = 1
IMAGES_PER_GPU = 1
def prepare_mrcnn_model(model_path, model_name, class_names, my_config):
classes = open(class_names).read().strip().split("\n")
print("No. of classes", len(classes))
hsv = [(i / len(classes), 1, 1.0) for i in range(len(classes))]
COLORS = list(map(lambda c: colorsys.hsv_to_rgb(*c), hsv))
random.seed(42)
random.shuffle(COLORS)
model = modellib.MaskRCNN(mode="inference", model_dir=model_path, config=my_config)
model.load_weights(model_name, by_name=True)
return COLORS, model, classes
def custom_visualize(test_image, model, colors, classes, draw_bbox, mrcnn_visualize,
instance_segmentation):
detections = model.detect([test_image], verbose=1)[0]
if mrcnn_visualize:
matplotlib.use('TkAgg')
visualize.display_instances(test_image, detections['rois'], detections['masks'],
detections['class_ids'], classes, detections['scores'])
return
if instance_segmentation:
hsv = [(i / len(detections['rois']), 1, 1.0) for i in range(len(detections['rois']))]
colors = list(map(lambda c: colorsys.hsv_to_rgb(*c), hsv))
random.seed(42)
random.shuffle(colors)
for i in range(0, detections["rois"].shape[0]):
classID = detections["class_ids"][i]
mask = detections["masks"][:, :, i]
if instance_segmentation:
color = colors[i][::-1]
else:
color = colors[classID][::-1]
# To visualize the pixel-wise mask of the object
test_image = visualize.apply_mask(test_image, mask, color, alpha=0.5)
test_image = cv2.cvtColor(test_image, cv2.COLOR_RGB2BGR)
if draw_bbox:
for i in range(0, len(detections["scores"])):
(startY, startX, endY, endX) = detections["rois"][i]

classID = detections["class_ids"][i]
label = classes[classID]
score = detections["scores"][i]
if instance_segmentation:
color = [int(c) for c in np.array(colors[i]) * 255]
else:
3
5
color = [int(c) for c in np.array(colors[classID]) * 255]
cv2.rectangle(test_image, (startX, startY), (endX, endY), color, 2)
text = "{}: {:.2f}".format(label, score)
y = startY - 10 if startY - 10 > 10 else startY + 10
cv2.putText(test_image, text, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2)
return test_image

Week 6: Object detection with single stage and two stage detectors
Click here to download the source code to this post
In this tutorial, you’ll learn how to use the YOLO object detector to detect objects in both images
and video streams using Deep Learning, OpenCV, and Python.
By applying object detection, you’ll not only be able to determine what is in an image but also
where a given object resides!
We’ll start with a brief discussion of the YOLO object detector, including how the object detector
works.
From there we’ll use OpenCV, Python, and deep learning to:
Apply the YOLO object detector to images
Apply YOLO to video streams
We’ll wrap up the tutorial by discussing some of the limitations and drawbacks of the YOLO
object detector, including some of my personal tips and suggestions.
To learn how to use YOLO for object detection with OpenCV, just keep reading!
Program:
# import the necessary packages
import numpy as np
import argparse
import time
import cv2
import os
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
help="path to input image")
ap.add_argument("-y", "--yolo", required=True,
help="base path to YOLO directory")
ap.add_argument("-c", "--confidence", type=float, default=0.5,
help="minimum probability to filter weak detections")
ap.add_argument("-t", "--threshold", type=float, default=0.3,
3
6
help="threshold when applying non-maxima suppression")
args = vars(ap.parse_args())
# load the COCO class labels our YOLO model was trained on
labelsPath = os.path.sep.join([args["yolo"], "coco.names"])
LABELS = open(labelsPath).read().strip().split("\n")
# initialize a list of colors to represent each possible class label
np.random.seed(42)
COLORS = np.random.randint(0, 255, size=(len(LABELS), 3),
dtype="uint8")
# derive the paths to the YOLO weights and model configuration
weightsPath = os.path.sep.join([args["yolo"], "yolov3.weights"])
configPath = os.path.sep.join([args["yolo"], "yolov3.cfg"])
# load our YOLO object detector trained on COCO dataset (80 classes)
print("[INFO] loading YOLO from disk...")
net = cv2.dnn.readNetFromDarknet(configPath, weightsPath)
# load our input image and grab its spatial dimensions
image = cv2.imread(args["image"])
(H, W) = image.shape[:2]
# determine only the *output* layer names that we need from YOLO
ln = net.getLayerNames()
ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]
# construct a blob from the input image and then perform a forward
# pass of the YOLO object detector, giving us our bounding boxes and
# associated probabilities
blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416),
swapRB=True, crop=False)
net.setInput(blob)
start = time.time()
layerOutputs = net.forward(ln)
end = time.time()
# show timing information on YOLO
print("[INFO] YOLO took {:.6f} seconds".format(end - start))
# initialize our lists of detected bounding boxes, confidences, and
# class IDs, respectively
boxes = []
confidences = []
classIDs = []
# loop over each of the layer outputs
for output in layerOutputs:
# loop over each of the detections
for detection in output:
# extract the class ID and confidence (i.e., probability) of
# the current object detection
scores = detection[5:]
classID = np.argmax(scores)
confidence = scores[classID]
# filter out weak predictions by ensuring the detected
# probability is greater than the minimum probability
3
7
if confidence > args["confidence"]:
# scale the bounding box coordinates back relative to the
# size of the image, keeping in mind that YOLO actually
# returns the center (x, y)-coordinates of the bounding
# box followed by the boxes' width and height
box = detection[0:4] * np.array([W, H, W, H])
(centerX, centerY, width, height) = box.astype("int")
# use the center (x, y)-coordinates to derive the top and
# and left corner of the bounding box
x = int(centerX - (width / 2))
y = int(centerY - (height / 2))
# update our list of bounding box coordinates, confidences,
# and class IDs
boxes.append([x, y, int(width), int(height)])
confidences.append(float(confidence))
classIDs.append(classID)
# apply non-maxima suppression to suppress weak, overlapping bounding
# boxes
idxs = cv2.dnn.NMSBoxes(boxes, confidences, args["confidence"],
args["threshold"])
# ensure at least one detection exists
if len(idxs) > 0:
# loop over the indexes we are keeping
for i in idxs.flatten():
# extract the bounding box coordinates
(x, y) = (boxes[i][0], boxes[i][1])
(w, h) = (boxes[i][2], boxes[i][3])
# draw a bounding box rectangle and label on the image
color = [int(c) for c in COLORS[classIDs[i]]]
cv2.rectangle(image, (x, y), (x + w, y + h), color, 2)
text = "{}: {:.4f}".format(LABELS[classIDs[i]], confidences[i])
cv2.putText(image, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX,
0.5, color, 2)
# show the output image
cv2.imshow("Image", image)
cv2.waitKey(0)
python yolo.py --image images/baggage_claim.jpg --yolo yolo-coco
[INFO] loading YOLO from disk...
[INFO] YOLO took 0.347815 seconds

3
8
python yolo.py --image images/living_room.jpg --yolo yolo-coco
YOLO from disk...
YOLO took 0.340221 seconds

Two stage detectors using SSD


What is SSD?
Single Shot MultiBox Detector (SSD) is a popular and efficient deep learning-based object
detection method that has been widely adopted in various real-world applications. It was
introduced in 2016 and has since then been improved upon to deliver better accuracy and speed.

The key idea behind SSD is to perform object detection in a single shot, as opposed to two-stage
methods such as R-CNN and its variants, which use region proposals followed by classification. In
SSD, a convolutional neural network (CNN) is trained to predict object class scores and bounding
box offsets, directly from the feature maps generated by the base network.

One of the important innovations in SSD is the use of multiple feature maps with different
resolutions, which allows the network to handle objects of various sizes in an effective manner.
The network generates a set of default boxes over different aspect ratios and scales for each feature
map location and predicts class scores and bounding box offsets for each of these default boxes.

SSD combines predictions from multiple feature maps to achieve a balance between accuracy and
speed. It is computationally efficient, as it eliminates the bounding box proposals and subsequent

3
9
pixel or feature resampling stage. Additionally, SSD uses a small convolutional filter to predict
object categories and offsets in bounding box locations and applies these filters to multiple feature
maps from the later stages of the network to perform detection at multiple scales.
SO :
Single Shot: means that the tasks of object localization and classification are done in a single
forward pass of the network.
MultiBox: is the name of a technique for bounding box regression developed by Szegedy et al.
(we will briefly cover it shortly).
Detector: The network is an object detector that also classifies those detected objects.
SSD Architecture :
No alt text provided for this image
The SSD object detection composes of 2 parts:
Feature Extractor.
Objects Detector.
Multiboxes are like anchors of Fast R-CNN. We have multiple default boxes of different sizes, and
aspect ratios across the entire image as shown below. SSD uses 8732 boxes. This helps with
finding the default box that most overlaps with the ground truth bounding box containing objects.
MultiBox’s loss function combined two critical components that made their way into SSD:
Confidence Loss: this measures how confident the network is of the objectness of the computed
bounding box. Categorical cross-entropy is used to compute this loss.
No alt text provided for this image
Location Loss: this measures how far away the network’s predicted bounding boxes are from the
ground truth ones from the training set. L2-Norm is used here.
No alt text provided for this image
Total loss :
The alpha term helps us in balancing the contribution of the location loss.
No alt text provided for this image
Program:
conda create -n od python=3.9
conda activate od
pip install opencv-python numpy imutils
from imutils.video import FPS
import numpy as np
import imutils
import cv2
use_gpu = True
live_video = False
confidence_level = 0.5
fps = FPS().start()
ret = True
CLASSES = ["background", "aeroplane", "bicycle", "bird", "boat",
"bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
"dog", "horse", "motorbike", "person", "pottedplant", "sheep",
"sofa", "train", "tvmonitor"]
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))
net = cv2.dnn.readNetFromCaffe('ssd_files/MobileNetSSD_deploy.prototxt',
'ssd_files/MobileNetSSD_deploy.caffemodel')
if use_gpu:
4
0
print("[INFO] setting preferable backend and target to CUDA...")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
print("[INFO] accessing video stream...")
if live_video:
vs = cv2.VideoCapture(0)
else:
vs = cv2.VideoCapture('test.mp4')
while ret:
ret, frame = vs.read()
if ret:
frame = imutils.resize(frame, width=400)
(h, w) = frame.shape[:2]
blob = cv2.dnn.blobFromImage(frame, 0.007843, (300, 300), 127.5)
net.setInput(blob)
detections = net.forward()
for i in np.arange(0, detections.shape[2]):
confidence = detections[0, 0, i, 2]
if confidence > confidence_level:
idx = int(detections[0, 0, i, 1])
box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
(startX, startY, endX, endY) = box.astype("int")
label = "{}: {:.2f}%".format(CLASSES[idx], confidence * 100)
cv2.rectangle(frame, (startX, startY), (endX, endY), COLORS[idx], 2)
y = startY - 15 if startY - 15 > 15 else startY + 15
cv2.putText(frame, label, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5,
COLORS[idx], 2)
cv2.imshow('Live detection',frame)
if cv2.waitKey(1)==27:
break
fps.update()
fps.stop()
print("[INFO] elasped time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))

Week 7: Image Captioning with Vanilla RNNs


Image captioning is a branch of artificial intelligence research that focuses on recognising natural
scenes and explaining them in natural language. Image captioning could help with retrieval by
allowing us to organise and request pictorial or image-based content in new ways. Mapping the
space between visuals and text could be a sign of deeper progress. Creating captions for images is
an important task in both the fields of computer vision and natural language processing. The major
goal of this work is to capture how items in the image interact with one another and represent them
in plain language (like English).

The main challenge of this task is to capture how objects relate to each other in the image and to
express them in a natural language (like English). Image captioning has a variety of uses, including
editing software recommendations, virtual assistants, image indexing, accessibility for visually
impaired people, social media, and other natural language processing applications.
4
1
1) Using pretrained CNN to extract image features. A pretrained VGG16 CNN will be used to
extract image features which will be concatenated with the RNN output.
2) Prepare training data. The training captions will be tokenized and embedded using the
GLOVE word embeddings. The embeddings will be fed into the RNN.
3) Model definition
4) Training the model
5) Generating novel image captions using the trained model. Test images and images from the
internet will be used as input to the trained model to generate captions. The captions will be
examined to determine the weaknesses of the model and suggest improvements.
6) Beam search. We will use beam search to generate better captions using the model.
7) Model evaluation. The model will be evaluated using the BLEU and ROUGE metric.

1. Obtain Features Using Pre-trained Image Models


# from os import listdir
from pickle import dump
from pickle import load
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
import numpy as np
from keras.preprocessing.text import Tokenizer
from collections import Counter
from keras.utils import to_categorical
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.utils import plot_model
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
The VGG16 model is used to extract the image features. We first load the model.

base_model = VGG16(include_top=True)
base_model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
4
2
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
flatten (Flatten) (None, 25088) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 102764544
_________________________________________________________________
fc2 (Dense) (None, 4096) 16781312
_________________________________________________________________
predictions (Dense) (None, 1000) 4097000
=================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
_________________________________________________________________
The feature extraction model will use the VGG16 input as model input. However, the second last
layer "fc2" of VGG16 will be used as the output of our extraction model. This is so because we do
not need the final softmax layer of VGG16.
4
3
model = Model(inputs=base_model.input, outputs=base_model.get_layer('fc2').output)
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
flatten (Flatten) (None, 25088) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 102764544
_________________________________________________________________
4
4
fc2 (Dense) (None, 4096) 16781312
=================================================================
Total params: 134,260,544
Trainable params: 134,260,544
Non-trainable params: 0
_________________________________________________________________
After the image model has been defined, we will use it to extract the features of all the images.

features = dict()
for file in listdir('Flicker8k_Dataset'):
img_path = 'Flicker8k_Dataset/' + file
img = load_img(img_path, target_size=(224, 224)) #size is 224,224 by default
x = img_to_array(img) #change to np array
x = np.expand_dims(x, axis=0) #expand to include batch dim at the beginning
x = preprocess_input(x) #make input confirm to VGG16 input format
fc2_features = model.predict(x)

name_id = file.split('.')[0] #take the file name and use as id in dict


features[name_id] = fc2_features

dump(features, open('features.pkl', 'wb')) #cannot use JSON because ndarray is not JSON
serializable.

2. Preparing Text Caption Data


We will create a dictionary with the filename as key and a list of corresponding caption as the
value to store the data. We need to preprocess the data first before inserting into the dictionary. We
need to:

1. Remove all numbers and punctuations


2. Change all letters to lower case
3. Remove words containing single characters
4. Add 'START' and 'END' to the target data

2.1 Loading Data Sets Image ID


The data is divided into traing/dev/test sets. The ids of the image belonging to each data set is
stored in a text file. The following function loads the image ids for each data sets.

We first define a function that can load the training/test/dev ids that are stored in corresponding
files.

def load_data_set_ids(filename):
file = open(filename, 'r')
text = file.read()
file.close()

dataset = list()
for image_id in text.split('\n'):
if len(image_id) < 1:
4
5
continue
dataset.append(image_id)
return set(dataset)
training_set = load_data_set_ids('Flickr_8k.trainImages.txt')
dev_set = load_data_set_ids('Flickr_8k.devImages.txt')
test_set = load_data_set_ids('Flickr_8k.testImages.txt')
After the images for each set is identified, we clean up the captions by:

1. Remove all numbers and punctuations


2. Change all letters to lower case
3. Remove words containing single characters

Add 'START' and 'END' to the target data


import string
filename = 'Flickr8k.token.txt'
file = open(filename, 'r')
token_text = file.read()
file.close()

translator = str.maketrans("", "", string.punctuation) #translation table that maps all punctuation to
None
image_captions = dict()
image_captions_train = dict()
image_captions_dev = dict()
image_captions_test = dict()
image_captions_other = dict()
corpus = list() #corpus used to train tokenizer
corpus.extend(['<START>', '<END>', '<UNK>']) #add SOS and EOS to list first

max_imageCap_len = 0

for line in token_text.split('\n'): # first split on new line


tokens = line.split(' ') #split on white space, the first segment is 1000268201_693b08cb0e.jpg#0,
the following segements are caption texts
if len(line) < 2:
continue
image_id, image_cap = tokens[0], tokens[1:] #use the first segment as image id, the rest as
caption
image_id = image_id.split('#')[0] #strip out #0 from filename
image_cap = ' '.join(image_cap) #join image caption together again

image_cap = image_cap.lower() #change to lower case


image_cap = image_cap.translate(translator) #take out punctuation using a translation table

image_cap = image_cap.split(' ') #split string here because following two methods works on
word-level best
image_cap = [w for w in image_cap if w.isalpha()] #keep only words that are all letters
image_cap = [w for w in image_cap if len(w)>1]
4
6
image_cap = '<START> ' + ' '.join(image_cap) + ' <END>' #add sentence start/end; note syntax:
separator.join()

#update maximum caption length


if len(image_cap.split()) > max_imageCap_len:
max_imageCap_len = len(image_cap.split())

#add to dictionary
if image_id not in image_captions:
image_captions[image_id] = list() #creat a new list if it does not yet exist
image_captions[image_id].append(image_cap)

#add to train/dev/test dictionaries


if image_id in training_set:
if image_id not in image_captions_train:
image_captions_train[image_id] = list() #creat a new list if it does not yet exist
image_captions_train[image_id].append(image_cap)
corpus.extend(image_cap.split()) #add only training words to corpus to train tokenlizer

elif image_id in dev_set:


if image_id not in image_captions_dev:
image_captions_dev[image_id] = list() #creat a new list if it does not yet exist
image_captions_dev[image_id].append(image_cap)

elif image_id in test_set:


if image_id not in image_captions_test:
image_captions_test[image_id] = list() #creat a new list if it does not yet exist
image_captions_test[image_id].append(image_cap)
else:
if image_id not in image_captions_other:
image_captions_other[image_id] = list() #creat a new list if it does not yet exist
image_captions_other[image_id].append(image_cap)

caption_train_tokenizer = Tokenizer() #initialize tokenizer


caption_train_tokenizer.fit_on_texts(corpus) #fit tokenizer on training data

fid = open("image_captions.pkl","wb")
dump(image_captions, fid)
fid.close()

fid = open("image_captions_train.pkl","wb")
dump(image_captions_train, fid)
fid.close()

fid = open("image_captions_dev.pkl","wb")
dump(image_captions_dev, fid)
fid.close()

4
7
fid = open("image_captions_test.pkl","wb")
dump(image_captions_test, fid)
fid.close()

fid = open("image_captions_other.pkl","wb")
dump(image_captions_other, fid)
fid.close()

fid = open("caption_train_tokenizer.pkl","wb")
dump(caption_train_tokenizer, fid)
fid.close()

fid = open("corpus.pkl","wb")
dump(corpus, fid)
fid.close()

corpus_count=Counter(corpus)
fid = open("corpus_count.pkl","wb")
dump(corpus_count, fid)
fid.close()

print("size of data =", len(image_captions), "size of training data =", len(image_captions_train),


"size of dev data =", len(image_captions_dev), "size of test data =", len(image_captions_test), "size
of unused data =", len(image_captions_other))
print("maximum image caption length =",max_imageCap_len)
size of data = 8092 size of training data = 6000 size of dev data = 1000 size of test data = 1000 size
of unused data = 92
maximum image caption length = 33

2.2 Using Pretrained Embeddings:


Instead of training the word embeddings, we will use the pretrained GloVE word embeddings.
After loading the embedding, we will form the embedding matrix by extracting the embedding
values for all the words in the corpus and insert into the matrix entry indexed by the words token.

embeddings_index = dict()
fid = open('glove.6B.50d.txt' ,encoding="utf8")
for line in fid:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
fid.close()
EMBEDDING_DIM = 50
word_index = caption_train_tokenizer.word_index
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))

for word, idx in word_index.items():


embed_vector = embeddings_index.get(word)
4
8
if embed_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[idx] = embed_vector

fid = open("embedding_matrix.pkl","wb")
dump(embedding_matrix, fid)
fid.close()

2.3 Generating Training Data for Progressive Loading:

When using the RNN as the languate model and a affine network to generate words, we need to
feed the already generated caption into the model and get the next word. Therefore, to generate a
caption of n words, the mode needs to run n+1 times (n words plus token). During training, we also
need to run the model n+1 times, and generate a separate training sequence for each run. There are
6000 images in the training data set, and 5 captions for each image. The maximum length of the
caption is 33 words. This comes to a maximum of 6000×5×33
or 990,000 training samples. To generate this many traning samples at the same time (keep in
mind we need to concatenate the images features to each sample too) would require a memory size
of at least 32GB.

Therefore, we will generate the training data on-the-fly, just before the model requires it. That is,
we will generate the training data one batch at a time, and then input the data into the model as
needed. This is often called progressive loading.

We first define a module that a caption of lenght n and generates the n+1 training data.

def create_sequences(tokenizer, max_length, desc_list, photo, vocab_size):


X1, X2, y = list(), list(), list()
# walk through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0] #[0] is used to take out the extra dim. This
changes from text to a number
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
# import pdb; pdb.set_trace()
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
X1.append(photo)
X2.append(in_seq)
y.append(out_seq)
return np.array(np.squeeze(X1)), np.array(X2), np.array(y)
We then define a module that will generate batch_size number of training data at a time. It will call
4
9
the previous create_sequence generator batch_size times, and comcatenate the data into a single
batch.

# data generator, intended to be used in a call to model.fit_generator()


def data_generator(descriptions, photos, tokenizer, max_length, batch_size, vocab_size):
# loop for ever over images
current_batch_size=0
while 1:
for key, desc_list in descriptions.items():
# retrieve the photo feature
if current_batch_size == 0:
X1, X2, Y = list(), list(), list()

imageFeature_id = key.split('.')[0]
photo = photos[imageFeature_id][0]
in_img, in_seq, out_word = create_sequences(tokenizer, max_length, desc_list, photo,
vocab_size)
#in_img = np.squeeze(in_img)
X1.extend(in_img)
X2.extend(in_seq)
Y.extend(out_word)
current_batch_size += 1
if current_batch_size == batch_size:
current_batch_size = 0
yield [[np.array(X1), np.array(X2)], np.array(Y)]
We test our progressive-loading generator.

from pickle import load


fid = open('features.pkl', 'rb')
image_features = load(fid)
fid.close()
# test the data generator
caption_max_length = 33
batch_size = 1
vocab_size = 7057
generator = data_generator(image_captions_train, image_features, caption_train_tokenizer,
caption_max_length, batch_size, vocab_size)
inputs, outputs = next(generator)
print(inputs[0].shape)
print(inputs[1].shape)
print(outputs.shape)

output:
(47, 4096)
(47, 33)
(47, 7057)
3 Model Definition
We are finally ready to define our model. We use the VGG16 model as our base model for the
5
0
CNN. We replace the last softmax layer freeze with another affine layer with 256 output and add a
dropout layer. The original layers of the VGG16 model is frozen. The image if input into the input
of the VGG16 layer. We take the GLOVE embedding and also freeze its parameters. The words
are fed as input to the embedding. The output of the embedding is fed into an LSTM RNN with
256 states. The output of the LSTM (256 dimensionss) and the output of the CNN (256
dimensions) is concatenated together to for a 512 dimensional input to a dense layer. The output of
the dense layer is fed into a softmax function.

from keras.layers.merge import concatenate


def define_model_concat(vocab_size, max_length, embedding_matrix):
# feature extractor model
inputs1 = Input(shape=(4096,))
image_feature = Dropout(0.5)(inputs1)
image_feature = Dense(256, activation='relu')(image_feature)
# sequence model
inputs2 = Input(shape=(max_length,))
language_feature = Embedding(vocab_size, 50, weights=[embedding_matrix],
input_length=max_length, trainable=False)(inputs2)
#Embedding(vocab_size, 256, mask_zero=True)(inputs2) #<<<<<< fix me, add pretrianed
embedding
language_feature = Dropout(0.5)(language_feature)
language_feature = LSTM(256)(language_feature)
# decoder model
output = concatenate([image_feature, language_feature])
output = Dense(256, activation='relu')(output)
output = Dense(vocab_size, activation='softmax')(output)
# tie it together [image, seq] [word]
model = Model(inputs=[inputs1, inputs2], outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
# summarize model
print(model.summary())
plot_model(model, to_file='model_concat.png', show_shapes=True)
return model

fid = open("embedding_matrix.pkl","rb")
embedding_matrix = load(fid)
fid.close()

caption_max_length = 33
vocab_size = 7506
post_rnn_model_concat = define_model_concat(vocab_size, caption_max_length,
embedding_matrix)
_______________________________________________________________________________
___________________
Layer (type) Output Shape Param # Connected to
======================================================================
============================
input_9 (InputLayer) (None, 33) 0
5
1
_______________________________________________________________________________
___________________
input_8 (InputLayer) (None, 4096) 0
_______________________________________________________________________________
___________________
embedding_4 (Embedding) (None, 33, 50) 375300 input_9[0][0]
_______________________________________________________________________________
___________________
dropout_7 (Dropout) (None, 4096) 0 input_8[0][0]
_______________________________________________________________________________
___________________
dropout_8 (Dropout) (None, 33, 50) 0 embedding_4[0][0]
_______________________________________________________________________________
___________________
dense_10 (Dense) (None, 256) 1048832 dropout_7[0][0]
_______________________________________________________________________________
___________________
lstm_4 (LSTM) (None, 256) 314368 dropout_8[0][0]
_______________________________________________________________________________
___________________
concatenate_1 (Concatenate) (None, 512) 0 dense_10[0][0]
lstm_4[0][0]
_______________________________________________________________________________
___________________
dense_11 (Dense) (None, 256) 131328 concatenate_1[0][0]
_______________________________________________________________________________
___________________
dense_12 (Dense) (None, 7506) 1929042 dense_11[0][0]
======================================================================
============================
Total params: 3,798,870
Trainable params: 3,423,570
Non-trainable params: 375,300

4 Training Model
We use the progressive loading data generator to generate the training data on-the-fly. For each
batch, we generate training data from 6 images.

fid = open("features.pkl","rb")
image_features = load(fid)
fid.close()

fid = open("caption_train_tokenizer.pkl","rb")
caption_train_tokenizer = load(fid)
fid.close()

fid = open("image_captions_train.pkl","rb")
image_captions_train = load(fid)
5
2
fid.close()

fid = open("image_captions_dev.pkl","rb")
image_captions_dev = load(fid)
fid.close()

caption_max_length = 33
batch_size = 100
vocab_size = 7506
#generator = data_generator(image_captions_train, image_features, caption_train_tokenizer,
caption_max_length, batch_size, vocab_size)

#epochs = 2
#steps = len(image_captions_train)
#steps_per_epoch = np.floor(steps/batch_size)
batch_size = 6
steps = len(image_captions_train)
steps_per_epoch = np.floor(steps/batch_size)

epochs = 3
for i in range(epochs):
# create the data generator
generator = data_generator(image_captions_train, image_features, caption_train_tokenizer,
caption_max_length, batch_size, vocab_size)
# fit for one epoch
post_rnn_model_concat_hist=post_rnn_model_concat.fit_generator(generator, epochs=1,
steps_per_epoch=steps, verbose=1)
# save model
post_rnn_model_concat.save('modelConcat_1_' + str(i) + '.h5')
Epoch 1/1
6000/6000 [==============================] - 6933s 1s/step - loss: 3.8574 - acc: 0.2588
Epoch 1/1
6000/6000 [==============================] - 6904s 1s/step - loss: 3.0718 - acc: 0.3152
Epoch 1/1
6000/6000 [==============================] - 7606s 1s/step - loss: 2.8371 - acc: 0.3410
The training is terminated after 3 epochs. The loss was around 8 at beginning of the training
process. It quickly went down to 3 after 3 epochs. Training more epochs will further reduce loss.

5 Creating Captions using Trained Model


The model should converge farily quickly, within a few hours or so. We can finally generate some
captions and see how well it does.

from pickle import load


from numpy import argmax
from keras.preprocessing.sequence import pad_sequences
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
5
3
from keras.applications.vgg16 import preprocess_input
from keras.models import Model
from keras.models import load_model
After importing the packages, we will load the VGG16 network. This neural network is the same
network used to extract the features in our image captioning model. Loading it will take several
minutes, so we will load it only once here. We will also load the tokenizer saved before when we
were pre-processing the training data. We will also define a function to extract the image features

base_model = VGG16(include_top=True)
feature_extract_pred_model = Model(inputs=base_model.input,
outputs=base_model.get_layer('fc2').output)
def extract_feature(model, file_name):
img = load_img(file_name, target_size=(224, 224)) #size is 224,224 by default
x = img_to_array(img) #change to np array
x = np.expand_dims(x, axis=0) #expand to include batch dim at the beginning
x = preprocess_input(x) #make input confirm to VGG16 input format
fc2_features = model.predict(x)
return fc2_features
# load the tokenizer
caption_train_tokenizer = load(open('caption_train_tokenizer.pkl', 'rb'))
# pre-define the max sequence length (from training)
max_length = 33
# load the model
#pred_model = load_model('model_3_0.h5')
pred_model = load_model('modelConcat_1a_2.h5')
To generate the caption, we first initialize the caption with the "START" token. We then input the
caption into the model, which will output the next word in the caption. The generated word will be
appended to the end of the caption and fed back into the model. The iterative process stops when
the "end" token is received.

def generate_caption(pred_model, caption_train_tokenizer, photo, max_length):


in_text = '<START>'
caption_text = list()
for i in range(max_length):
# integer encode input sequence
sequence = caption_train_tokenizer.texts_to_sequences([in_text])[0]
# pad input
sequence = pad_sequences([sequence], maxlen=max_length)
# predict next word
model_softMax_output = pred_model.predict([photo,sequence], verbose=0)
# convert probability to integer
word_index = argmax(model_softMax_output)
# map integer to word
word = caption_train_tokenizer.index_word[word_index]
#print(word)
# stop if we cannot map the word
if word is None:
break
5
4
# append as input for generating the next word
in_text += ' ' + word
# stop if we predict the end of the sequence
if word != 'end':
caption_text.append(word)
if word == 'end':
break
return caption_text
caption_image_fileName = 'running-dogs.jpg'
photo = extract_feature(feature_extract_pred_model, caption_image_fileName)
caption = generate_caption(pred_model, caption_train_tokenizer, photo, max_length)
print(' '.join(caption))

WEEK8: Image captioning with LSTMs

Caption generation is a challenging artificial intelligence problem where a textual description must
be generated for a given photograph.
It requires both methods from computer vision to understand the content of the image and a
language
model from the field of natural language processing to turn the understanding of the image into
words
in the right order. Recently, deep learning methods have achieved state-of-the-art results on
examples
of this problem.
Deep learning methods have demonstrated state-of-the-art results on caption generation problems.
What is most impressive about these methods is a single end-to-end model can be defined to
predict a
caption, given a photo, instead of requiring sophisticated data preparation or a pipeline of
specifically
designed models.

import numpy as np # linear algebra


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
from numpy import array
import matplotlib.pyplot as plt
%matplotlib inline

import string
import os
import pandas as pd
from time import time
import cv2
from keras import Input, layers
from keras import optimizers
from keras.optimizers import Adam
from keras.preprocessing import sequence
from keras.preprocessing import image
5
5
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.layers import LSTM, Embedding, Dense, Activation, Flatten, Reshape,


Dropout,Lambda
from keras.layers.merge import add
from keras.utils import to_categorical
from keras.preprocessing.image import load_img,img_to_array
%config InlineBackend.figure_format = &#39;retina&#39;
token_path = &quot;../input/flickr-8k/Flickr8k_text/Flickr8k.token.txt&quot;
train_images_path = &quot;../input/flickr-8k/Flickr8k_text/Flickr_8k.trainImages.txt&quot;
test_images_path = &quot;../input/flickr-8k/Flickr8k_text/Flickr_8k.testImages.txt&quot;
images_path = &quot;../input/flickr-8k/Flickr8k_Dataset/Flicker8k_Dataset/&quot;
glove_path = &quot;../input/glove6b/&quot;
doc = open(token_path,&#39;r&#39;).read()
print(doc[:400])
descriptions = dict()
for line in doc.split(&#39;\n&#39;):
tokens = line.split()
if len(line) &gt; 2:
image_id = tokens[0].split(&#39;.&#39;)[0]
image_desc = &#39; &#39;.join(tokens[1:])
image_desc = &#39;startseq &#39; + image_desc + &#39; endseq&#39;
# # print(image_desc)
if image_id not in descriptions:
descriptions[image_id] = list()
descriptions[image_id].append(image_desc)
len(descriptions.values())
table = str.maketrans(&#39;&#39;,&#39;&#39;,string.punctuation)
for key, desc_list in descriptions.items():
for i in range(len(desc_list)):
desc = desc_list[i]
desc = desc.split()

desc = [word.lower() for word in desc]


desc = [w.translate(table) for w in desc]
desc_list[i] = &#39; &#39;.join(desc)
pic = &#39;1000268201_693b08cb0e.jpg&#39;
x=plt.imread(images_path+pic)
plt.imshow(x)
plt.show()

descriptions[&#39;1000268201_693b08cb0e&#39;]

5
6
Week 9: Network visualization :Saliency maps, Class Visualization
The saliency map is a key theme in deep learning and computer vision. During the training of a deep
convolutional neural network, it becomes essential to know the feature map of every layer. The feature
maps of CNN tell us the learning characteristics of the model. Suppose we want to focus on a particular part
of an image than what concept will help us. Yes, it is a saliency map. It mainly focuses on specific pixels of
images while ignoring others.
We begin by creating a ResNet50 with ImageNet weights. With the simple helper functions, we import the
image on the disc and prepare it for feeding to the ResNet50.
# Import necessary packages
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
def input_img(path):
image = tf.image.decode_png(tf.io.read_file(path))
image = tf.expand_dims(image, axis=0)
image = tf.cast(image, tf.float32)
image = tf.image.resize(image, [224,224])
return image
def normalize_image(img):
grads_norm = img[:,:,0]+ img[:,:,1]+ img[:,:,2]
grads_norm = (grads_norm - tf.reduce_min(grads_norm))/ (tf.reduce_max(grads_norm)-
tf.reduce_min(grads_norm))
return grads_norm
def get_image():
import urllib.request
filename = 'image.jpg'
img_url = r"https://upload.wikimedia.org/wikipedia/commons/d/d7/White_stork_%28Ciconia_ciconia
%29_on_nest.jpg"
urllib.request.urlretrieve(img_url, filename)
def plot_maps(img1, img2,vmin=0.3,vmax=0.7, mix_val=2):
f = plt.figure(figsize=(15,45))
plt.subplot(1,3,1)
plt.imshow(img1,vmin=vmin, vmax=vmax, cmap="ocean")
plt.axis("off")
plt.subplot(1,3,2)
plt.imshow(img2, cmap = "ocean")
plt.axis("off")
plt.subplot(1,3,3)
plt.imshow(img1*mix_val+img2/mix_val, cmap = "ocean" )
plt.axis("off")
Input image

5
7
Prediction vector ResNet50 will be loaded from Keras applications directly.
test_model = tf.keras.applications.resnet50.ResNet50()
#test_model.summary()
get_image()
img_path = "image.jpg"
input_img = input_img(img_path)
input_img = tf.keras.applications.densenet.preprocess_input(input_img)
plt.imshow(normalize_image(input_img[0]), cmap = "ocean")
result = test_model(input_img)
max_idx = tf.argmax(result,axis = 1)
tf.keras.applications.imagenet_utils.decode_predictions(result.numpy())
A GradientTape function is available on TensorFlow 2.x that is capable of handling the backpropagation
related operations. Here, we will utilize the benefits of GradientTape to compute the saliency map of the
given image.
with tf.GradientTape() as tape:
tape.watch(input_img)
result = test_model(input_img)
max_score = result[0,max_idx[0]]
grads = tape.gradient(max_score, input_img)
plot_maps(normalize_image(grads[0]), normalize_image(input_img[0]))

fig2: (1) Saliency_map, (2) input_image, (3) overlap text

week 10: Generative Adversarial Networks (GANs)


Generative Adversarial Networks (GANs) are a powerful class of neural networks that are used for
unsupervised learning. It was developed and introduced by Ian J. Good fellow in 2014. GANs are basically
made up of a system of two competing neural network models which compete with each other and are able
to analyze, capture and copy the variations within a dataset.
Generative: To learn a generative model, which describes how data is generated in terms of a probabilistic
model.
Adversarial: The training of a model is done in an adversarial setting.
Networks: Use deep neural networks as artificial intelligence (AI) algorithms for training purposes.
Handwritten Digits Generator With a GAN
Generative adversarial networks can also generate high-dimensional samples such as images. In this
example, you’re going to use a GAN to generate images of handwritten digits. For that, you’ll train the
5
8
models using the MNIST dataset of handwritten digits, which is included in the torchvision package.

To begin, you need to install torchvision in the activated gan conda environment:

$ conda install -c pytorch torchvision=0.5.0


Again, you’re using a specific version of torchvision to assure the example code will run, just like you did
with pytorch. With the environment set up, you can start implementing the models in Jupyter Notebook.
Open it and create a new Notebook by clicking on New and then selecting gan.
import torch
from torch import nn
import math
import matplotlib.pyplot as plt
import torchvision
import torchvision.transforms as transforms
Besides the libraries you’ve imported before, you’re going to need torchvision and transforms to obtain the
training data and perform image conversions.
Again, set up the random generator seed to be able to replicate the experiment:
torch.manual_seed(111)
Since this example uses images in the training set, the models need to be more complex, with a larger
number of parameters. This makes the training process slower, taking about two minutes per epoch when
running on CPU. You’ll need about fifty epochs to obtain a relevant result, so the total training time when
using a CPU is around one hundred minutes.
To reduce the training time, you can use a GPU to train the model if you have one available. However,
you’ll need to manually move tensors and models to the GPU in order to use them in the training process.
You can ensure your code will run on either setup by creating a device object that points either to the CPU
or, if one is available, to the GPU:
device = "" if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
Later, you’ll use this device to set where tensors and models should be created, using the GPU if available.
Now that the basic environment is set, you can prepare the training data.
Preparing the Training Data
The MNIST dataset consists of 28 × 28 pixel grayscale images of handwritten digits from 0 to 9. To use
them with PyTorch, you’ll need to perform some conversions. For that, you define transform, a function to
be used when loading the data:
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]
)
The function has two parts:
transforms.ToTensor() converts the data to a PyTorch tensor.
transforms.Normalize() converts the range of the tensor coefficients.
The original coefficients given by transforms.ToTensor() range from 0 to 1, and since the image
backgrounds are black, most of the coefficients are equal to 0 when they’re represented using this range.
transforms.Normalize() changes the range of the coefficients to -1 to 1 by subtracting 0.5 from the original
coefficients and dividing the result by 0.5. With this transformation, the number of elements equal to 0 in
the input samples is dramatically reduced, which helps in training the models.
The arguments of transforms.Normalize() are two tuples, (M₁, ..., Mₙ) and (S₁, ..., Sₙ), with n representing
the number of channels of the images. Grayscale images such as those in MNIST dataset have only one
channel, so the tuples have only one value. Then, for each channel i of the image, transforms.Normalize()
subtracts Mᵢ from the coefficients and divides the result by Sᵢ.
Now you can load the training data using torchvision.datasets.MNIST and perform the conversions using

5
9
transform:
train_set = torchvision.datasets.MNIST(
root=".", train=True, download=True, transform=transform
)
The argument download=True ensures that the first time you run the above code, the MNIST dataset will be
downloaded and stored in the current directory, as indicated by the argument root.
Now that you’ve created train_set, you can create the data loader as you did before:
batch_size = 32
train_loader = torch.utils.data.DataLoader(
train_set, batch_size=batch_size, shuffle=True
)
You can use Matplotlib to plot some samples of the training data. To improve the visualization, you can use
cmap=gray_r to reverse the color map and plot the digits in black over a white background:

real_samples, mnist_labels = next(iter(train_loader))


for i in range(16):
ax = plt.subplot(4, 4, i + 1)
plt.imshow(real_samples[i].reshape(28, 28), cmap="gray_r")
plt.xticks([])
plt.yticks([])
The output should be something similar to the following:
Samples of the training set
As you can see, there are digits with different handwriting styles. As the GAN learns the distribution of the
data, it’ll also generate digits with different handwriting styles.
class Discriminator(nn.Module):
def __init__(self):
super().__init__()
self.model = nn.Sequential(
nn.Linear(784, 1024),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(1024, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 1),
nn.Sigmoid(),
)

def forward(self, x):


x = x.view(x.size(0), 784)
output = self.model(x)
return output
discriminator = Discriminator().to(device=device)
class Generator(nn.Module):
def __init__(self):
super().__init__()
self.model = nn.Sequential(
nn.Linear(100, 256),
nn.ReLU(),

6
0
nn.Linear(256, 512),
nn.ReLU(),
nn.Linear(512, 1024),
nn.ReLU(),
nn.Linear(1024, 784),
nn.Tanh(),
)
def forward(self, x):
output = self.model(x)
output = output.view(x.size(0), 1, 28, 28)
return output
generator = Generator().to(device=device)
training the model
lr = 0.0001
num_epochs = 50
loss_function = nn.BCELoss()
optimizer_discriminator = torch.optim.Adam(discriminator.parameters(), lr=lr)
optimizer_generator = torch.optim.Adam(generator.parameters(), lr=lr)
or epoch in range(num_epochs):
for n, (real_samples, mnist_labels) in enumerate(train_loader):
# Data for training the discriminator
real_samples = real_samples.to(device=device)
real_samples_labels = torch.ones((batch_size, 1)).to(
device=device
)
latent_space_samples = torch.randn((batch_size, 100)).to(
device=device
)
generated_samples = generator(latent_space_samples)
generated_samples_labels = torch.zeros((batch_size, 1)).to(
device=device
)
all_samples = torch.cat((real_samples, generated_samples))
all_samples_labels = torch.cat(
(real_samples_labels, generated_samples_labels)
)
# Training the discriminator
discriminator.zero_grad()
output_discriminator = discriminator(all_samples)
loss_discriminator = loss_function(
output_discriminator, all_samples_labels
)
loss_discriminator.backward()
optimizer_discriminator.step()
# Data for training the generator
latent_space_samples = torch.randn((batch_size, 100)).to(
device=device
)

# Training the generator


generator.zero_grad()
generated_samples = generator(latent_space_samples)

6
1
output_discriminator_generated = discriminator(generated_samples)
loss_generator = loss_function(
output_discriminator_generated, real_samples_labels
)
loss_generator.backward()
optimizer_generator.step()
# Show loss
if n == batch_size - 1:
print(f"Epoch: {epoch} Loss D.: {loss_discriminator}")
print(f"Epoch: {epoch} Loss G.: {loss_generator}")

Week 11: Chatbot using bi-directional LSTMs


Bidirectional long-short term memory (Bidirectional LSTM) is the process of making any neural
network o have the sequence information in both directions backwards (future to past) or forward
(past to future).
In bidirectional, our input flows in two directions, making a bi-lstm different from the regular
LSTM. With the regular LSTM, we can make input flow in one direction, either backwards or
forward. However, in bi-directional, we can make the input flow in both directions to preserve the
future and the past information. For a better explanation, let’s have an example.

In the sentence “boys go to …..” we cannot fill the blank space. Still, when we have a future
sentence “boys come out of school”, we can easily predict the past blank space the similar thing we
want to perform by our model and bidirectional LSTM allows the neural network to perform this.

In the diagram, we can see the flow of information from backward and forward layers. BI-LSTM is
usually employed where the sequence to sequence tasks are needed. This kind of network can be
used in text classification, speech recognition and forecasting models. Next in the article, we are
going to make a bi-directional LSTM model using python.
Step 1: Import all the packages
import numpy as np
import tensorflow as tf
import pickle
from tensorflow.keras import layers, activations, models, preprocessing

6
2
Step 2: Download all the data from kaggle
!pip install kaggle
from google.colab import files
files.upload()
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d kausr25/chatterbotenglish
!unzip /content/chatterbotenglish.zip
!wget https://github.com/shubham0204/Dataset_Archives/blob/master/chatbot_nlp.zip?raw=true -
O chatbot_nlp.zip
!unzip chatbot_nlp.zip

Step 3: Preprocessing the data


a) Reading the data from the files
We parse each of the .yaml files.
1) Concatenate two or more sentences if the answer has two or more of them.
2) Remove unwanted data types which are produced while parsing the data.
3) Append and to all the answers.
4) Create a Tokenizer and load the whole vocabulary ( questions + answers ) into it.
from tensorflow.keras import preprocessing, utils
import os
import yaml
dir_path = '/content/chatbot_nlp/data'
files_list = os.listdir(dir_path + os.sep)
questions = list()
answers = list()
for filepath in files_list:
stream = open( dir_path + os.sep + filepath , 'rb')
docs = yaml.safe_load(stream)
conversations = docs['conversations']
for con in conversations:
if len( con ) > 2 :
questions.append(con[0])
replies = con[ 1 : ]
ans = ''
for rep in replies:
ans += ' ' + rep
answers.append( ans )
elif len( con )> 1:
questions.append(con[0])
answers.append(con[1])
answers_with_tags = list()
for i in range( len( answers ) ):
if type( answers[i] ) == str:
answers_with_tags.append( answers[i] )
6
3
else:
questions.pop( i )
answers = list()
for i in range( len( answers_with_tags ) ) :
answers.append( ' ' + answers_with_tags[i] + ' ' )
tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( questions + answers )
VOCAB_SIZE = len( tokenizer.word_index )+1
print( 'VOCAB SIZE : {}'.format( VOCAB_SIZE ))

b) Preparing data for Seq2Seq model


This model requires 3 arrays encoder_input_data, decoder_input_data and decoder_output_data.
For encoder_input_data: Tokensize the Questions and Pad them to their maximum Length.

For decoder_input_data: Tokensize the Answers and Pad them to their maximum Length.
For decoder_output_data: Tokensize the Answers and Remove the 1st element from all the
tokenized_answers. This is the element which was added earlier.
from gensim.models import Word2Vec
import re
vocab = []
for word in tokenizer.word_index:
vocab.append(word)
def tokenize(sentences):
tokens_list = []
vocabulary = []
for sentence in sentences:
sentence = sentence.lower()
sentence = re.sub('[^a-zA-Z]', ' ', sentence)
tokens = sentence.split()
vocabulary += tokens
tokens_list.append(tokens)
return tokens_list, vocabulary
#encoder_input_data
tokenized_questions = tokenizer.texts_to_sequences( questions )
maxlen_questions = max( [len(x) for x in tokenized_questions ] )
padded_questions = preprocessing.sequence.pad_sequences( tokenized_questions, maxlen =
maxlen_questions, padding = 'post')
encoder_input_data = np.array(padded_questions)
print (encoder_input_data.shape, maxlen_questions)
# decoder_input_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
maxlen_answers = max( [ len(x) for x in tokenized_answers ] )
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers ,
maxlen=maxlen_answers , padding='post' )
decoder_input_data = np.array( padded_answers )
print ( decoder_input_data.shape , maxlen_answers )
# decoder_output_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
6
4
for i in range(len(tokenized_answers)) :
tokenized_answers[i] = tokenized_answers[i][1:]
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers ,
maxlen=maxlen_answers , padding='post' )
onehot_answers = utils.to_categorical( padded_answers , VOCAB_SIZE )
decoder_output_data = np.array( onehot_answers )
print ( decoder_output_data.shape )
Step 4: Defining Encoder Decoder Model
encoder_inputs = tf.keras.layers.Input(shape=( maxlen_questions , ))
encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True )
(encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 200 , return_state=True )
( encoder_embedding )
encoder_states = [ state_h , state_c ]

decoder_inputs = tf.keras.layers.Input(shape=( maxlen_answers , ))


decoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200 , mask_zero=True)
(decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( 200 , return_state=True , return_sequences=True )
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
decoder_dense = tf.keras.layers.Dense( VOCAB_SIZE , activation=tf.keras.activations.softmax )
output = decoder_dense ( decoder_outputs )

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )


model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')
model.summary()

Step 5: Training the Model


We train the model for a number of epochs with RMSprop optimizer and categorical_crossentropy
loss function.
model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=50,
epochs=150 )
model.save( 'model.h5' )
Step 6: Defining Inference Models
Encoder Inference Model: Takes questions as input and outputs LSTM states (h and c)
Decoder Inference Model: Takes in 2 inputs one are the LSTM states, second are the answer input
sequences. it will o/p the answers for questions which fed to the encoder model and it's state
values.

def make_inference_models():
encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
decoder_state_input_h = tf.keras.layers.Input(shape=( 200 ,))
decoder_state_input_c = tf.keras.layers.Input(shape=( 200 ,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
decoder_embedding , initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
6
5
decoder_model = tf.keras.models.Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)
return encoder_model , decoder_model
Step 7: Talking with the Chatbot
define a method str_to_tokens which converts str questions to Integer tokens with padding.
First, we take a question as input and predict the state values using enc_model.
We set the state values in the decoder's LSTM.
Then, we generate a sequence which contains the element.
We input this sequence in the dec_model.
We replace the element with the element which was predicted by the dec_model and update the
state values.
We carry out the above steps iteratively till we hit the tag or the maximum answer length.
def str_to_tokens( sentence : str ):
words = sentence.lower().split()
tokens_list = list()
for word in words:
tokens_list.append( tokenizer.word_index[ word ] )
return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=maxlen_questions ,
padding='post')
enc_model , dec_model = make_inference_models()
for _ in range(10):
states_values = enc_model.predict( str_to_tokens( input( 'Enter question : ' ) ) )
empty_target_seq = np.zeros( ( 1 , 1 ) )
empty_target_seq[0, 0] = tokenizer.word_index['start']
stop_condition = False
decoded_translation = ''
while not stop_condition :
dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
sampled_word = None
for word , index in tokenizer.word_index.items() :
if sampled_word_index == index :
decoded_translation += ' {}'.format( word )
sampled_word = word
if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
stop_condition = TruE
empty_target_seq = np.zeros( ( 1 , 1 ) )
empty_target_seq[ 0 , 0 ] = sampled_word_index
states_values = [ h , c ]
print( decoded_translation )
Conversion to TFLite
We can convert our seq2seq model to a TensorFlow Lite model so that we can use it on edge
devices
!pip install tf-nightly
converter = tf.lite.TFLiteConverter.from_keras_model( enc_model )
buffer = converter.convert()
open( 'enc_model.tflite' , 'wb' ).write( buffer )
6
6
converter = tf.lite.TFLiteConverter.from_keras_model( dec_model )
open( 'dec_model.tflite' , 'wb' ).write( buffer )

Week 12: Familiarization of cloud based computing like Google colab


To begin, go to the Google Colab website and sign in by clicking the blue button on the top right of
the web page. You must have a Gmail account to use this tool. If you do not already have a Gmail
account, then you will have to create one.
Once you are signed in, you should see a pop-up box similar to the one shown below.

You can create a new notebook file by clicking on NEW NOTEBOOK, but for now, close the pop
up by clicking on cancel or by clicking on the shaded area outside of the pop-up. Another way of
creating a new notebook is to click on the File tab (top left) -> New Notebook.
Notice that there are options for opening different files and uploading files, these will be used later.
Create a new Colab notebook following the steps provided in the above section.
Before we start getting into the coding, let’s familiarize ourselves with the user interface (UI) of
Google Colab.
What the different buttons mean:
1.Files: Here you will be able to upload datasets and other files from both your computer and
Google Drive
2.Code Snippets: Here you will be able to find prewritten snippets of code for different
functionalities like adding new libraries or referencing one cell from another.
3.Run Cell: This is the run button. Clicking this will run any code that is inserted in the cell beside
it. You can use the shortcut shift+enter to run the current cell and exit to a new one.
4.Table of Contents: Here you will be able to create and traverse different sections inside of your
notebook. Sections allow you to organize your code and improve readability.
5.Menu Bar: Like in any other application, this menu bar can be used to manipulate the entire file
or add new files. Look over the different tabs and familiarize yourself with the different options. In
particular, make sure you know how to upload or open a notebook and download the notebook (all
of these options are under “File”).
6.File Name: This is the name of your file. You can click on it to change the name. Do not edit the
extension (.ipynb) while editing the file name as this might make your file unopenable.
7.Insert Code Cell: This button will add a code cell below the cell you currently have selected.
8.Insert Text Cell: This button will add a text cell below the cell you currently have selected.
9.Cell: This is the cell. This is where you can write your code or add text depending on the type of
cell it is.
10.Output: This is the output of your code, including any errors, will be shown.
6
7
11.Clear Output: This button will remove the output.

12.RAM and Disk: All of the code you write


Google will User
Go lab run on Google’s computer, and you will only
see the output. This means that even if you have a slow computer, running big chunks of code will
Interface
not be an issue. Google only allots a certain amount of Ram and Disk space for each user, so be
mindful of that as you work on larger projects.
13.Link to Cell: This button will create a URL that will link to the cell you have selected.
14.Comment: This button will allow you to create a comment on the selected cell. Note that this
will be a comment on (about) the cell and not a comment in the cell.
15.Settings: This button will allow you to change the Theme of the notebook, font type, and size,
indentation width.
16.Delete Cell: This button will delete the selected cell.
17.More Options: Contains options to cut and copy a cell as well as the option to add form and
hide code.

6
8
6
9

You might also like