Machine Learning Engineer Nanodegree: Capstone Proposal
Machine Learning Engineer Nanodegree: Capstone Proposal
Nanodegree
Capstone Proposal
Khalil Henchi
January 22th, 2020
A. Definition
1. Project Overview
Domain Background
Machine learning is the new trend of technology and the most popular in the 21 century until
now. This is due to the increasing performance of computers and calculators.
The use of machine learning in the computer vision field is a subject that continues to fuel the
curiosity of scientists and engineers. In fact, scientists have been trying to make machines extract
meaningful information from visual data for about 60 years now. The breakthrough that made
computer vision reappear in the surface as a hot topic was in 2012 when AlexNet won ImageNet.
For machine learning community, dog breed classification challenge is well-known. This
challenge is also available on Kaggle [1]
As udacity provides this project in the list of possible capstone project, I decided to work in it as
my capstone project because my goal is to get a job as a computer vision engineer so this project
will be a valuable asset on my CV.
1
Datasets and Inputs
The dataset for this project is provided by Udacity. We have pictures of dogs and humans. Each
image is identified by a unique id.
We have 8351 total dog images. Dog pictures are split into three folder:
In each group, images are sorted given the dog’s breed. We have 133 dog breeds.
Human pictures are sorted by name of each human. We have 13 233 total human pictures.
By analyzing the datasets, we see that all pictures are taken from different and various angles.
Besides, their dimensions are differents, and in some pictures their more than an object.
2. Problem Statement
For this project our goal is to detect whether there a human or a dog or none of them in a given
photo. In the case where there is a dog detected in the photo, we will look for its corresponding
breed. In the other case where there is a human detected, we will look for its most resembling
dog breed. In the last case where no human nor dog are detected, we will show an error message.
Images are random with different sizes, taken from different angles, and in different moments
during the day.
2
3. Evaluation Metrics
Depending on the dataset, accuracy may not be a good metric for a classification problem. In this
case precision and recall can be good evaluation metrics. F1 score is a possible metric as it
combines precision and recall.
The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and
recall, it is given by the following formula:
I checked the dog breed dataset and the classes (breeds) are relatively balanced, so a simple
accuracy score is considered representative in this project.
3
B. Analysis
1. Data Exploration
The dataset for this project is provided by Udacity. We have pictures of dogs and humans. Each
image is identified by a unique id.
We have 8351 total dog images. Dog pictures are split into three folder:
In each group, images are sorted given the dog’s breed. We have 133 dog breeds.
Human pictures are sorted by name of each human. We have 13 233 total human pictures.
By analyzing the datasets, we see that all pictures are taken from different and various angles.
Besides, their dimensions are differents, and in some pictures their more than an object: more
than a human, more than a dog, both human and dog are present .
2. Exploratory Visualization
The following image how some sample from the data sets.
4
Data set samples
There are:
5
3. Algorithms and Techniques
- Deep learning:
Deep learning (also known as deep structured learning or hierarchical learning) is part of
a broader family of machine learning methods based on learning data representations, as
opposed to task-specific algorithms.
A convolutional neural network (CNN or ConvNet) is a class of deep, feed-forward
artificial neural networks that has successfully been applied to analyzing visual imagery.
A CNN consists of an input and an output layer, as well as multiple hidden layers. The
hidden layers are either convolutional, pooling or fully connected. We give CNN an input
and it learns by itself that what features it has to detect. We won't specify the initial
values of features or what kind of patterns it has to detect.
Various Layers:
● Convolutional - Also referred to as Conv. layer, it forms the basis of the
CNN and performs the core operations of training and consequently firing the neurons of
the network. It performs the convolutional operation over the input.
6
● Pooling layers -Pooling layers reduce the spatial dimensions (Width x Height)
of the input Volume for the next Convolutional Layer. It does not affect the depth
dimension of the Volume.
● Fully connected layer - The fully connected or Dense layer is configured
exactly the way its name implies. It is fully connected with the output of the previous
layer. Fully connected layers are typically used in the last stages of the CNN to
connected to the output layer and construct the desired number of outputs.
● Dropout layer - Dropout is a regularization technique for reducing overfitting
in neural networks by preventing complex co-adaptations on training data. It is a very
efficient way of performing model averaging with neural networks. The term "dropout"
refers to dropping out units (both hidden and visible) in a neural network.
● Flatten - Flattens the output of the convolutional layers to feed into the
Dense layers.
- Activation Functions:
In CNN, the activation function of a node defines the output of that node given an
input or set of inputs.
Some activation functions are:
● The softmax function squashes the output of each unit to be between 0 and 1,
just like a sigmoid function. It also divides each output such that the total sum of the
outputs is equal to 1.
● A ReLu (or rectified linear unit) has output 0 if the input is less than 0,
and raw output otherwise. i.e, if the input is greater than 0, the output is equal to the
input.
- Transfer Learning:
In transfer learning, we take the learned understanding and pass it to a new deep
learning model. We take a pre-trained neural network and adapt it to a new neural
network with different dataset.
For this problem we use “ResNet-101” neural network.
● ResNet-101 is a convolutional neural network that is trained on more than a million images
from the ImageNet database. The network is 101 layers deep and can classify images into 1000
object categories, such as keyboard, mouse, pencil, and many animals. As a result, the network
has learned rich feature representations for a wide range of images. The network has an image
input size of 224-by-224.
7
4. Benchmark Model
For the benchmark model, we will use the algorithms outlined in the paper [2]. The paper
describes five different algorithms with the following accuracies.
After defining the model architecture, it was trained on the training set with validation
split of 20% and the best weights were saved during the training process. After training,
predictions were made on the test set.
8
C. Methodology
1. Data Preprocessing
Based on our exploratory visualization, we can see that the samples are not of the same size.
Most neural networks expect the images of a fixed size. Therefore, we will need to apply some
preprocessing to the data set.
2. Implementation
Our first in this project was to create a CNN from scratch. A random chance is 1 out 133, so
anything above 1% is better than random. But, according to the notes in the Jupyter notebook,
we should get something greater than 10%.
For the basic CNN, I chose the following architecture:
CNN architecture
9
I utilized max pooling to reduce the dimensionality. Pooling makes the CNN run faster but also
reduces overfitting.
In the next section, we’ll apply transfer learning to use an already-established architecture to
hopefully optimize results.
3. Refinement
Since creating a CNN from scratch did not perform so good. Its accuracy was about 11%. This is
better than random, but there’s a lot of room to improve. First, the VGG16 model was utilized.
So, I utilized a different pre-trained model, ResNet 101 given its high accuracy.
With this architecture, we made some modifications to add a fully connected layer with
combination of linear layers with Dropout regularization, and fully connected dense layer as the
output layer.
After training the model and testing it, we get significantly improved results and accuracy. For
less number of epochs, we get 80% of accuracy.
I added a Dropout layer to reduce overfitting. The final layer of the model is used to predict the
category (one of the 133 dog breeds).
10
D. Results
1. Model Evaluation and Validation
The “from scratch” dog breed classifier has an accuracy of 11%. Whereas our architecture with
transfer learning has an accuracy of 80%.
In both cases, the accuracy is higher than the defined benchmark.
The model correctly knew it was a dog or human every time, and it also matched the dog breeds
appropriately. (You can see these results in the Jupyter Notebook).
2. Justification
The scratch-made CNN likely performed so poorly as it was hardly given any training data
compared to other pretrained architectures like ResNet.
In addition to the complexity of the architecture itself, the ResNet architecture was also trained
on vastly more images than I trained my scratch-made architecture on.
I think with more data the scratch-made CNN can perform better.
E. Reference
1. https://en.wikipedia.org/wiki/Convolutional_neural_network
2. https://www.kaggle.com/c/dog-breed-identification/overview
3. "Using Convolutional Neural Networks to Classify Dog Breeds" (Hsu, 2012)
4. ImageNet. http://www.image-net.org
5. https://www.pyimagesearch.com/2018/12/31/keras-conv2d-and-convolutional-layers/
11