Incremental Training For Image Classification of Unseen Objects
Incremental Training For Image Classification of Unseen Objects
net/publication/345061606
CITATIONS READS
0 31
2 authors, including:
Harshil Jain
Indian Institute of Technology Gandhinagar
3 PUBLICATIONS 3 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Harshil Jain on 31 October 2020.
Abstract
Object detection is a computer vision technique for locating instances of objects in images or videos. It
basically deals with the detection of instances of semantic objects of a certain class in digital images and
videos. It is a spine of a lot of practical applications of computer vision including image retrieval, self-
driving cars, face recognition, object tracking, video surveillance, etc. Hence, object detection is significantly
encompassing many fields in today’s world. Object detection can be achieved through traditional machine
learning approaches which are histogram of oriented gradients (HoG) or scale-invariant feature transform
(SIFT) features and also through various deep learning approaches which include two broad categories. First
is an architecture which uses two neural networks which includes region proposals (R-CNN, Fast R-CNN &
Faster R-CNN) & second is single shot detectors which includes You Only Look Once (YOLO) and Single
Shot MultiBox Detector (SSD). YOLO and SSD are way faster than RCNN and its derivatives. YOLO
basically uses Darknet for feature extraction followed by convolutional layers for object localization while
SSD uses VGG-16 for feature extraction. Though the problem of object detection is gaining the attention
of the research community, most of the works have concentrated on improving current object detection
algorithms. Detection of objects on unseen classes for which the networks were never trained has been
overlooked. In this work, an attempt has been made to understand the YOLO architecture and answer
various questions related to it and also to improve the existing single shot detectors like YOLO and SSD to
classify unseen classes in real time by incremental learning. This can prove very robust as it is very difficult
retrain these huge convolutional networks as and when new classes are added, that too in real time.
Keywords: Object Detection, Single Shot detectors, YOLO, Convolutional Neural Networks, Incremental
Learning, Computer Vision
1
1 Introduction
1.1 Background
Humans have the ability to glance at an image and tell what objects are there in that image and where
they are situated. The human visual system is fast and accurate, allowing us to perform complex tasks like
driving with little conscious thought. Fast, accurate algorithms for object detection would allow computers
to drive cars without specialized sensors, enable assistive devices to convey real-time scene information to
human users, and unlock the potential for general purpose, responsive robotic systems.
Current object detection algorithms take a classifier for that object and evaluate it at various locations
and scales in a test image. Some systems use a sliding window approach - where a particular size window
is run throughout the image at specific strides to check where the object might lie. Recent approaches like
Regional - CNN use region proposals where they propose a region using image segmentation to propose
probable regions of the location of the object and then pass them through convolutional layers for proper
localization and detection. These algorithms take quite a lot of time to come up with the bounding boxes and
the classes, the object belongs to. This is because it passes through the huge convolutional network many
times. Hence, these detectors cannot be used to detect objects in real time because of very high latency.
You Only Look Once (YOLO) algorithm is very much different from all of these. It reframes object
detection as a single regression problem, straight from image pixels to bounding box coordinates and class
probabilities. It only runs a single pass of the image through the convolutional network and predicts where
and what the objects are. Because of this, YOLO algorithm can be used to detect objects in real time and
it also outperforms traditional object detection methods.
1.4 Scope
The present work work involves experimenting with YOLO and mainly on incremental training for unseen
classes for classification on the VGG16 network. This thing can be extended for object localization of unseen
classes by incremental training.
2
2 Literature Review
2.1 Information
2.1.1 YOLOv1
YOLOv1 is a single convolutional network simultaneously predicts multiple bounding boxes and class prob-
abilities for those boxes. YOLO trains on full images and directly optimizes detection performance.
Figure 1: The architecture has 24 convolutional layers followed by 2 fully connected layers. The
model is pretrained on the ImageNet classification task at half the resolution (224 × 224 input
image) and then double the resolution for detection.
Network Design: The model has 24 convolutional layers followed by 2 fully connected layers (the Darknet
framework). The convolutional layers are mainly responsible for feature extraction while the fully connected
layers mainly predict the output probabilities and coordinates. The final output is 7 × 7 × 30 tensor.
The model divides the input image into an S × S grid. If the centre of an object falls into a grid cell,
that grid cell is responsible for detecting that object. Each grid cell has B bounding boxes. The algorithm
predicts confidence scores for each of these boxes. Confidence score is a measure of accuracy of prediction
as to whether there is an object in the box and if there is how accurately the bounding box has been drawn.
Figure 2: The image is divided into an S × S grid and for each grid cell predicts B bounding
boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as
an S × S × (B ∗ 5 + C) tensor.
3
Each bounding box has 5 predictions −x, y, w, h and confidence score. The (x, y) are the co-ordinates of
the centre of the box relative to the grid cell and w and h are with respect to the entire image. Each grid cell
also has C conditional class probabilities which represents the probability of the centre of the object falling
in that particular grid cell. The output tensor should have the dimensions (S × S × (B ∗ 5 + C)). YOLOv1
divided the image into 7 × 7 grid cell and there were 2 bounding boxes per grid cell. Since YOLO was
evaluated on PASCAL VOC which has 20 classes, we have S = 7, B = 2 and C = 20. The final prediction
is a 7 × 7 × 30 tensor.
2 2
S X
B S X
B √ √
obj √
X X p
obj 2 2
λcoord Iij [(xi − x̂i ) + (yi − ŷi ) ] + λcoord Iij [( wi − ŵi )2 + ( hi − ĥi )2 ]
i=0 j=0 i=0 j=0
2 2 2
S X
X B S X
X B S
X X
obj noobj
+ Iij (Ci 2
− Ĉi ) + λnoobj Iij (Ci 2
− Ĉi ) + Iiobj (pi (c) − p̂i (c))2
i=0 j=0 i=0 j=0 i=0 c∈classes
where
2.1.2 YOLOv2
YOLOv2 brought a lot of changes to YOLOv1. It divides the image into 13 × 13 grid cells and it removed
the fully connected layers. The final output here is a 13 × 13 × [5 × (4 + 1 + 20)] tensor for the 4 bounding box
offsets, 1 confidence score per box, and 20 class predictions. Following are the major changes in YOLOv2
compared to YOLOv1.
• Batch Normalization
Batch normalization leads to significant improvements in convergence while eliminating the need for
other forms of regularization. By adding batch normalization on all of the convolutional layers in YOLO an
improvement in mAP is observed. Batch normalization also helps regularize the model. Batch normalisation
prevents overfitting by adding regularization and dropouts can be eliminated.
• High resolution classifier
For YOLOv2, the classification network is fine tuned at the full 448 × 448 resolution for 10 epochs on
ImageNet. This gives the network time to adjust its filters to work better on higher resolution input. We
then fine tune the resulting network on detection. This high resolution classification network gives us an
increase of almost 4% mAP.
• Convolution with anchor boxes
Anchor boxes are predefined bounding boxes which are used as guidelines by the network to predict
offsets for bounding boxes from the anchor box under consideration.
Figure 3: An illustration where 5 anchor boxes of different shapes and sizes are selected for the
model
4
Instead of predicting 5 arbitrary boundary boxes, we predict offsets to each of the anchor boxes above.
If we constrain the offset values, we can maintain the diversity of the predictions and have each prediction
focuses on a specific shape. So the initial training will be more stable.
bx = σtx + cx
by = σty + cy
bw = pw etw
bh = ph eth
where tx , ty , th and tw are outputs of YOLO algorithm, cx and cy is the top left corner of the grid cell of
the anchor, pw and ph are the width and height of anchor and bx , by , bh and bw are the co-ordinates of the
predicted bounding box.
• Dimension clusters
YOLOv2 uses k-means clustering on the ground truth boxes of the training dataset to pick the k anchor
boxes instead of handpicking the anchor boxes. k = 5 is chosen as a good trade off between model complexity
and high recall.
5
2.1.3 YOLOv3
YOLOv3 improvises the previous versions in the following aspects:
• Class Prediction
Many classifiers make the assumption that classes are mutually exclusive. But, YOLOv3 does multi-
label classification. YOLOv3 replaces the softmax function with independent logistic classifier. Classes like
‘pedestrian’ and ‘man’ are not mutually exclusive and hence sum of confidences can be greater than 1. Also,
the loss is changed from mean squared error to binary cross entropy loss.
YOLOv3 assigns only 1 bounding box anchor for each ground truth object. YOLOv3 predicts an ob-
jectness score for each bounding box using logistic regression. This value is 1 if the bounding box anchor
overlaps a ground truth object to a greater extent than any other anchor. For other priors with overlap
greater than a predefined threshold (default 0.5), they incur no cost.
YOLOv3 predicts boxes at 3 different scales. It extracts features from those scales like Feature Pyramid
Network. It predicts 3 boxes at each scale so the tensor is N × N × [3 × (4 + 1 + 80)] for the 4 bounding box
offsets, 1 objectness prediction, and 80 class predictions.
6
• Feature Extractor
YOLOv3 uses Darknet-53 feature extractor which has 53 convolutional layers. This new network is much
more powerful than Darknet-19 but still more efficient than ResNet-101 or ResNet-152.
Figure 5: Darknet-53
2.2 Summary
All details pertaining to various versions of YOLO algorithm have been summarised. YOLO algorithm can
be used to predict only images containing trained classes. Incremental training for prediction of unseen
classes (which can be used for applications like video surveillance) is important because it is practically
impossible to retrain the entire network as and when new classes are added.
7
3 Experiments Related to YOLO Architecture
3.1 Confidence Score of New Objects
The definition of confidence score stated in the YOLO paper was ambiguous, it did not state clearly whether
confidence score is high if an object is present which belongs to the classes on which it was trained on or
not. The YOLO algorithm outputs the actual confidence score multiplied by the class probability as the
final confidence score while predicting an object. I modified the code it to only output the actual confidence
score. The results obtained on classes for which YOLO wasn’t trained are as follows. (Here we are able to
see the bounding boxes because the confidence threshold was reduced to 0.01 just to indicate the boxes. The
actual YOLO algorithm would output nothing for these images.)
This experiment concludes that the confidence score is high only for the objects for which
YOLO algorithm was trained on.
8
3.2 Effect of Background on YOLO training
On going through the YOLO architecture, we wanted to check whether background has an effect on how
YOLO was trained. So, to check this out we performed the following experiment.
3.2.3 Training
We trained the model on yolov3-tiny.cfg which is a tiny version of YOLOv3 on these 800 images on NVIDIA
GeForce GTX 1080Ti GPU. We then used the .weights file after 10,000 iterations when the loss was around
0.14.
9
3.3 Effect of Bounding Box Size on training
An experiment was made to check whether ground truth bounding boxes’ size made an effect on YOLO
training.
3.3.3 Training
We trained the model on yolov3.cfg on these 2600 images on NVIDIA GeForce GTX 1080Ti GPU. We then
used the .weights file after 23,000 iterations.
3.3.4 Testing
We then tested on new images which had an length and breadth greater than 0.5 times the image and YOLO
was able to detect dogs and cats in those images with very good accuracy. However, on giving test images
with ground truth boxes having length and breadth less than 0.5, either the bounding box was not accurate
to exactly enclose the object or YOLO didn’t draw the bounding box at all [Figures 1 and 2]. For those
images for which it drew the bounding box, it gave a bigger bounding box having aspect ratio greater than
0.5 as shown [Figures 3 and 4]. However as expected for images with ground truth box having length and
breadth greater than 0.5 times the image dimension, it was able to draw bounding boxes accurately. [Figures
5 and 6]. Only 9.37% of the test images threw up some bounding box.
This shows that the size of the ground truth bounding boxes used for YOLO have an effect
on the training.
10
3.4 Training YOLOv1 on Tensorflow
YOLOv1 was trained purely on Tensorflow on the PASCAL VOC dataset. Here is the code which was
implemented. Transfer Learning was employed. Darknet weights trained for classification were used to
initialise the first 20 convolutional layers of YOLOv1 and the weights of fully connected were randomly
initialised. We trained it for 2,30,000 iterations when the loss was around 1.61 on NVIDIA GeForce GTX
1080Ti GPU for around 2.5 days. It did give some mis predictions probably because the training was stopped
at a loss of 1.61, to achieve better accuracy it should fall even lower at around 0.06.
11
4 Methodology for Transfer Learning and Incremental
Training
Experiments were performed to train CIFAR-10 dataset on the VGG16 network which consists of 13 convo-
lutional layers, 3 fully connected layers and 5 max-pool layers as shown below. There is a ReLU activation
function after every convolutional layer.
• Additional layers (batch normalization, dropout) were introduced for improving accuracy, reducing over-
fitting.
Only the key observations that have significant impact on the training procedure are reported.
12
4.1.2 Partial Learning followed by Incremental Training
Introducing batch normalization (BN) between all layers improves accuracy by eliminating co-variate
shift that’s introduced when the normalized image is passed through the layers. The network is expected to
learn to normalize the images between the layers. Hence, BN introduces trainable parameters alongside the
weights (convolutional/FC). The network co-optimizes the weights and the BN parameters. When only BN
parameters were trained throughout the network (and not the weights), it still produces reasonable accuracy
as we will see in the results. We use these weights for the next experiment as follows.
It is sufficient to train a part of the dataset on only the last few convolution layers and FC layers. It
could be reasoned that the initial layers are mostly involved in extracting the features and the final layers are
involved in regression. Not modifying the initial features also help in preventing over-fitting (to a dataset).
Now, we do incremental training and train the entire dataset on the last 3 convolutional and fully connected
layers.
13
5 Results and Discussions
The graphs for Loss vs Iterations and Accuracy vs Iterations when the entire network is trained for 100%
training set and tested for 100% test set is as follows for Adam and Momentum optimizers. The accuracy is
85% after about 1000 iterations for Adam Optimizer.
Figure 13: Training BN parameters for entire network using Adam Optimizer
14
The weights of the previous experiment are used for the following couple of experiments.
The graphs for Loss vs Iterations and Accuracy vs Iterations for Partial Learning i.e. the last 3 convolutional
layers and the last 3 fully connected layers along with the BN parameters for all the layers only for 75% of
the train dataset is as follows. The accuracy achieved is 80% after about 1000 iterations.
Figure 14: Training 3 Conv + 3 FC and all BN parameters using Adam Optimizer on 75% train
dataset [The image on the right is the magnified version of the left]
The previous experiment was repeated with everything identical except that the BN parameters for only
the last layers were trained and not for the entire network only for 75% of the train dataset. The results are
as follows, The accuracy achieved is 80% after about 1000 iterations.
Figure 15: Training 3 Conv + 3 FC and corresponding BN parameters using Adam Optimizer
on 75% train dataset [The image on the right is the magnified version of the left]
15
Finally, for incremental training, we incrementally train 100% of the train dataset using the last few
layers. Here too, there are two variants, one with all BN parameters and the other with just corresponding
BN parameters. Here we use the weights of previous experiment for initialisation. It achieves a decent
accuracy of 82% after about 1000 iterations. This confirms that incremental training for the last few layers
yields good accuracy for unseen objects.
Figure 16: Training 3 Conv + 3 FC and LEFT) all BN parameters RIGHT) corresponding BN
parameters using Adam Optimizer on 100% train dataset
As described in section 4.2, we train the entire network for 8 classes excluding birds and trucks and the
results are as follows:
16
Then, it is incrementally trained for the two unseen classes - birds and trucks on only the last 3 convo-
lutional + 3 fully connected layers using the weights from the previous experiment and the results obtained
are as follows:
Figure 18: Incrementally training the last few layers of the network using Adam Optimizer on
2 unseen classes
6 Conclusions
Incremental training for object detection and localization of unseen objects can be very useful for real time
scenarios, for example video surveillance cameras. It can also be used for online training of these new classes
so that the model can be continuously retrained on the device itself. Apart from this, newer techniques
need to be introduced to improve the overall accuracy. Nevertheless, these techniques prove very useful for
instances like video surveillance.
17
7 References
[1] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object
detection. arXiv preprint arXiv:1506.02640, 2015.
[2] J. Redmon and A. Farhadi. YOLO9000: Better, faster, stronger. In Computer Vision and Pattern
Recognition (CVPR), 2017 IEEE Conference on, pages 6517-6525. IEEE, 2017
[3] J. Redmon and A. Farhadi. YOLOv3: An Incremental Improvement arXiv preprint arXiv:1804.02767,
2018.
[4] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,
localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.
[5] J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. Heck. H. Zhang1 C.-C. Jay Kuo. Class-incremental
Learning via Deep Model Consolidation arXiv preprint arXiv:1903.07864v2, 2019
[6] R. Istrate, A. Cristiano, I. Malossi, C. Bekas, D. Nikolopoulos. Incremental Training of Deep Convolu-
tional Neural Networks arXiv preprint arXiv:1803.10232, 2018
[7] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.
[8] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural
networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[9] J. Redmon. Darknet: Open source neural networks in C. http://pjreddie.com/darknet/, 2013-2016.
[10] Blog on YOLO: [Online] Available: https://machinethink.net/blog/object-detection/
[11] YOLOv1 Tensorflow. GitHub [Online] Available: https://github.com/MingtaoGuo/yolo_v1_v2_
tensorflow
[12] Tzutalin. LabelImg. Git code (2015). Available: https://github.com/tzutalin/labelImg
[13] DW2TF, GitHub. [Online]. Available : https://github.com/jinyu121/DW2TF
18