Deep Learning
Deep Learning
DEEP LEARNING
Chapter 1
What will we cover
● What is Machine Learning
● Fundamental concepts involved in Machine Learning
● Four Branches of Machine Learning
● What is Deep Learning
● How it works
● What it can achieve
Text Book / Reference Book
● Deep Learning with Python, Authored by FRANÇOIS CHOLLET
CHAPTER-1
AI, Machine Learning & Deep Learning
AI, Machine Learning & Deep Learning
ARTIFICIAL INTELLIGENCE
Artificial Intelligence
Idea of AI was born when scientists started to think / program computers to
do the tasks only a human can do. For a long time Symbolic AI ruled the world
in which we maintain a large set of rules. Symbolic AI had certain limitations in
solving perception problems, like recognizing / tagging an image, translating a
language to another language, etc.
How good an AI algorithm is (Turing Test)
MACHINE LEARNING
Machine Learning
The frustration of crafting hard coded rules made the scientists to think what
if a program can infer the rules to describe the answers / results by itself. This
thought pioneered the field of Machine Learning.
Classical Programs
Machine Learning
Essential Things in Machine Learning
For machine learning, we need three things:
Reference:
https://www.prowesscorp.com/whats-the-difference-between-artificial-intelligence-ai-machine-learning-and-deep-learning/
Deep Learning
In deep learning, these layered representations are (almost always) learned
via models called neural networks, structured in literal layers stacked on top
of each other.
Each layer being updated to follow both the representational needs of the
layer above and the needs of the layer below
Why Deep Learning? Why Now?
The two key ideas of deep learning for computer vision—convolutional neural
networks and backpropagation were already well understood in 1989. The
Long Short Term Memory (LSTM) algorithm, which is fundamental to deep
learning for time series, was developed in 1997 and has barely changed since.
So why did deep learning only take off after 2012?
● Hardware
● Data
● Algorithmic advances
Hardware
In past few years introduction of GPU and vendor libraries to compute
complex tasks over GPU made the deep learning shine as complex tasks on a
large amount of data are solved in a considerably small time. During the last
year Google has introduced TPUs which are specifically designed for Deep
Learning tasks and are even 10x faster than a GPU.
Data
When it comes to data, in addition to the exponential progress in storage
hardware over the past 20 years (following Moore’s law), the game changer
has been the rise of the internet, making it feasible to collect and distribute
very large datasets for machine learning. Today, large companies work with
image datasets, video datasets, and natural-language datasets that couldn’t
have been collected without the internet. User-generated image tags on Flickr,
for instance, have been a treasure trove of data for computer vision. So are
YouTube videos. And Wikipedia is a key dataset for natural-language
processing.
Algorithms
In addition to hardware and data, until the late 2000s, we were missing a
reliable way to train very deep neural networks. As a result, neural networks
were still fairly shallow, using only one or two layers of representations; thus,
they weren’t able to shine against more-refined shallow methods such as
SVMs and random forests. The key issue was that of gradient propagation
through deep stacks of layers. The feedback signal used to train neural
networks would fade away as the number of layers increased.
A NEW WAVE OF INVESTMENT
A New Wave Of Investment
● AI and machine learning have the potential to create an additional $2.6T
in value by 2020 in Marketing and Sales, and up to $2T in manufacturing
and supply chain planning.
● Gartner predicts the business value created by AI will reach $3.9T in 2022.
● IDC predicts worldwide spending on cognitive and Artificial Intelligence
systems will reach $77.6B in 2022.
Reference:
https://www.forbes.com/sites/louiscolumbus/2019/03/27/roundup-of-machine-learning-forecasts-and-market-estimates-2019/#399b12c
07695
The Democratization Of Deep Learning
Introduction of new tools for languages that support Deep Learning made it
to approachable to a common developer with the knowledge of high level
scripting languages like Python.
Will it last?
Deep learning has several properties that justify its status as an AI revolution,
and it’s here to stay. We may not be using neural networks two decades from
now, but whatever we use will directly inherit from modern deep learning and
its core concepts.
● Simplicity
● Scalability
● Versatility and reusability
Simplicity
Deep learning removes the need for feature engineering, replacing complex,
brittle, engineering-heavy pipelines with simple, end-to-end trainable models
that are typically built using only five or six different tensor operations
Scalability
Deep learning is highly amenable to parallelization on GPUs or TPUs, so it can
take full advantage of Moore’s law. In addition, deep-learning models are
trained by iterating over small batches of data, allowing them to be trained on
datasets of arbitrary size. (The only bottleneck is the amount of parallel
computational power available, which, thanks to Moore’s law, is a fast moving
barrier.)
Versatility And Reusability
Unlike many prior machine-learning approaches, deep-learning models can
be trained on additional data without restarting from scratch, making them
viable for continuous online learning an important property for very large
production models. Furthermore, trained deep-learning models are
repurposable and thus reusable: for instance, it’s possible to take a deep
learning model trained for image classification and drop it into a video
processing pipeline. This allows us to reinvest previous work into increasingly
complex and powerful models. This also makes deep learning applicable to
fairly small datasets.
MATHEMATICAL
BUILDING BLOCKS
OF NEURAL
NETWORKS
Chapter 2
What will we cover
● A basic example of Neural Network
● What is a Tensor
● Tensor Operations
● How Neural Networks Learn
● Backpropagation
● Gradient Descent
Classes And Labels
In machine learning, a category in a classification problem is called a class.
Data points are called samples. The class associated with a specific sample is
called a label.
Classification Problem
A classification problem is when the output variable is a category, such as
“red” or “blue” or “disease” and “no disease”.
Reference: https://www.geeksforgeeks.org/regression-classification-supervised-machine-learning/
INTRODUCTION OF GOOGLE COLAB
(DEMO)
SETTING UP LOCAL ENVIRONMENT
(DEMO)
Testing Environment
import tensorflow as tf
print(tf.__version__)
0 T-shirt/top 5 Sandal
1 Trouser 6 Shirt
2 Pullover 7 Sneaker
3 Dress 8 Bag
Flatten: Serves as input layer and flattens the rectangular pic array in 1d array
loss = 'sparse_categorical_crossentropy',
metrics=['accuracy'])
import numpy as np
x = np.array(12)
print(x.ndim) # prints 0
Vectors (1D Tensors)
An array of numbers is called a vector, or 1D tensor. A 1D tensor is said to
have exactly one axis. Following is a Numpy vector:
x = np.array([12, 3, 6, 14])
print(x.ndim) # => 1
Matrices (2D Tensors)
An array of vectors is a matrix, or 2D tensor. A matrix has two axes (often
referred to rows and columns). You can visually interpret a matrix as a
rectangular grid of numbers. This is a Numpy matrix:
print(x.ndim ) # => 2
3d Tensors And Higher-dimensional
Tensors
If you pack such matrices in a new array, you obtain a 3D tensor, which you
can visually interpret as a cube of numbers. Following is a Numpy 3D tensor:
x = np.array([[[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]],
[[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]],
[[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]]])
print(x.ndim) # => 3
Key Attributes Of A Tensor
A tensor is defined by three key :
1. Axes (called broadcast axes) are added to the smaller tensor to match the
ndim of the larger tensor.
2. The smaller tensor is repeated alongside these new axes to match the full
shape of the larger tensor.
Tensor dot
The dot operation, also called a tensor product (not to be confused with an
elementwise product) is the most common, most useful tensor operation.
Contrary to element-wise operations, it combines entries in the input tensors.
Tensor reshaping
A third type of tensor operation that’s essential to understand is tensor
reshaping. Reshaping a tensor means rearranging its rows and columns to
match a target shape. Naturally, the reshaped tensor has the same total
number of coefficients as the initial tensor.
x = x.reshape((6, 1))
In this expression, W and b are tensors that are attributes of the layer. Initially,
these weight matrices are filled with small random values (a step called
random initialization). What comes next is to gradually adjust these weights,
based on a feedback signal. This gradual adjustment, also called training, is
basically the learning that machine learning is all about.
Gradient-Based Optimization
This happens within what’s called a training loop, which works as follows.
Repeat these steps in a loop, as long as necessary:
Now we understand that this network consists of a chain of two Dense layers,
that each layer applies a few simple tensor operations to the input data, and
that these operations involve weight tensors. Weight tensors, which are
attributes of the layers, are where the knowledge of the network persists.
Looking Back At Our First Example
model.compile(optimizer = tf.keras.optimizers.Adam(), loss =
'sparse_categorical_crossentropy', metrics = ['accuracy'])
This was the network-compilation step: Now we understand that the loss
function is sparse_categorical_crossentropy that’s used as a feedback signal for
learning the weight tensors, and which the training phase will attempt to
minimize. We also know that this reduction of the loss happens via minibatch
stochastic gradient descent. The exact rules governing a specific use of
gradient descent are defined by the Adam optimizer passed as the first
argument.
Looking Back At Our First Example
Finally, this was the training loop:
model.fit(training_images, training_labels, epochs=5, batch_size=128)
Now we understand what happens when you call fit: the network will start to
iterate on the training data in mini-batches of 128 samples, 5 times over (each
iteration over all the training data is called an epoch). At each iteration, the
network will compute the gradients of the weights with regard to the loss on
the batch, and update the weights accordingly. At this point, we know most of
what there is to know about neural networks.
Getting Started
With Neural Networks
Chapter 3
Anatomy of a neural network
A neural network comprises of the following objects mainly:
Dense Layers or Fully Connected Layers are used for 2D tensors of shape
(samples, features)
The notion of layer compatibility refers to the fact that every layer will only
accept input tensors of a certain shape and will return output tensors of a
certain shape. Consider the following example:
This layer accepts 784 features (axis 0) and unspecified / any number of
samples (axis 1). The output of the layer is 32 at axis 1.
Layers: The building blocks of deep
learning
In Keras, Layer object has built in feature to adopt to the shape of its input
data. So the developer doesn’t have to worry about it. In example we
discussed previously,
model =
tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128,
activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)])
Here we can observe that every layer defines what it will output but not what
it takes as input.
Models: networks of layers
A deep-learning model is a directed, acyclic graph of layers. The most
common instance is a linear stack of layers, mapping a single input to a single
output. But as you move forward, you’ll be exposed to a much broader variety
of network topologies. Some common ones include the following:
● Two-branch networks
● Multihead networks
● Inception blocks
Picking the right network architecture is more an art than a science; and
although there are some best practices and principles you can rely on, only
practice can help you become a proper neural-network architect
Loss functions and optimizers:
Once the network architecture is defined, you still have to choose two more
things:
Currently Keras uses tensorflow as its default backend and has tight
integration with tf.keras module in TensorFlow 2.0.
Via TensorFlow (or Theano, or CNTK), Keras is able to run seamlessly on both
CPUs and GPUs. When running on CPU, TensorFlow is itself wrapping a low-
level library for tensor operations called Eigen (http://eigen.tuxfamily.org)
.
Keras, TensorFlow, Theano, and CNTK
On GPU, TensorFlow wraps a library of well-optimized deep-learning
operations called the NVIDIA CUDA Deep Neural Network library (cuDNN)
Developing with Keras: a quick overview
We’ve already seen one example of a Keras model: the Fashion MNIST
example. The typical Keras workflow looks just like that example:
model.compile(loss='sparse_categorical_crossentropy',
optimizer=keras.optimizers.RMSprop(), metrics=['accuracy'])
import numpy as np
print(train_data[0])
print(train_labels[0])
#If training data is from Movie reviews it must have been in text, how come it is
numbers. Argument num_words restricts function to load only 10,000 most used
words.
Contd. (Loading Dataset)
max([max(sequence) for sequence in train_data])
# As Argument num_words restricts function to load only 10,000 most used words.
To check we can execute the above snippet it will display the max word index in
train_data
Contd. (Preparing the data)
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
Contd. (Building the network)
model = models.Sequential()
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
Contd. (Building the network)
Because we are facing a binary classification problem and the output of our
network is a probability (we end our network with a single-unit layer with a
sigmoid activation), it’s best to use the binary_crossentropy loss. It isn’t the only
viable choice: we could use, for instance, mean_squared_error . But crossentropy
is usually the best choice when we’re dealing with models that output
probabilities. Crossentropy is a quantity from the field of Information Theory that
measures the distance between probability distributions or, in this case, between
the ground-truth distribution and our predictions. In next slide is the step where
we configure the model with the rmsprop optimizer and the binary_crossentropy
loss function. Note that we’ll also monitor accuracy during training.
Contd. (Compile the network)
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
# array([[0.00579366],
[1. ],
[0.4512035 ],
...,
[0.00185409],
[0.00611657],
[0.44180435]], dtype=float32)
Classifying newswires
A multiclass classification example
In this section, we’ll build a network to classify Reuters newswires into 46
mutually exclusive topics. Because we have many classes, this problem is an
instance of multi-class classification.
Single-label, multiclass classification
If each data point should be classified into one and only one category, the
problem is more specifically an instance of single-label, multiclass classification.
Multilabel, multiclass classification
If each data point could belong to multiple categories (in this case, topics), this
case is generally known as multilabel, multiclass classification problem.
The Reuters dataset
We’ll work with the Reuters dataset, a set of short newswires and their topics,
published by Reuters in 1986. It’s a simple, widely used toy dataset for text
classification. There are 46 different topics; some topics are more represented
than others, but each topic has at least 10 examples in the training set. Like IMDB
and MNIST , the Reuters dataset comes packaged as part of Keras
Classifying newswires (import modules)
import tensorflow as tf
import numpy as np
print(train_data[0])
print(train_labels[0])
Classifying newswires (Loading Dataset)
As with the IMDB dataset, the argument num_words=10000 restricts the data to
the 10,000 most frequently occurring words found in the data. We have 8,982
training examples and 2,246 test examples:
print(len(train_data)) # 8982
print(len(test_data)) # 2246
Classifying newswires (Preparing the
data)
def vectorize_sequences(sequences, dimension=10000):
results[i, sequence] = 1.
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
Classifying newswires (Preparing the
data)
def to_one_hot(labels, dimension=46):
results[i, label] = 1.
return results
one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)
Classifying newswires (Preparing the
data)
Instead of the approach in last slide we can create one hot encoded labels with
these simple snippets.
one_hot_train_labels = utils.to_categorical(train_labels)
one_hot_test_labels = utils.to_categorical(test_labels)
Classifying newswires (Preparing
Validation
Instead set)
of the approach in last slide we can create one hot encoded labels with
these simple snippets.
x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]
Classifying newswires (Building the
network)
model = models.Sequential()
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))
Observe the difference from the network we created for imdb network, at the
output layer we have softmax activation function instead of sigmoid other than
layer size.
Classifying newswires (Building the
network)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
metrics=['acc'])
validation_data=(x_val, y_val))
Contd. (Plotting the training and
validation loss)
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf()
Contd. (Plotting the training and
validation loss)
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc_values, 'bo', label='Training acc')
plt.plot(epochs, val_acc_values, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Contd. (Predictions on new data)
predictions = model.predict(x_test)
predictions[1].sum() # 1.000000
predictions[1].argmax() # 14
Contd. (using integer labels not one hot
encoding)
Another way to encode the labels would be to cast them as an integer tensor, like
this:
y_train = np.array(train_labels)
y_test = np.array(test_labels)
model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy',
metrics=['acc'])
Importance of Layer size
As there are 46 possible classes we should have intermediate layers greater or
equal to 46. If we would have any intermediate layer size lesser than 46 will result
in drop of accuracy.
Predicting house prices: Regression
example
K Fold Cross Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data
sample.
import numpy as np
print(train_data.shape)
print(test_data.shape)
print(train_targets)
Regression example (Normalizing the
data)
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std
Regression example (Building the
network)
def build_model():
model = models.Sequential()
model.add(layers.Dense(64, activation='relu',
input_shape=(train_data.shape[1],)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1))
return model
Regression example (K-Fold Validation)
k = 4
num_val_samples = len(train_data) // k
num_epochs = 100
all_scores = []
all_mae_histories = []
for i in range(k):
print('processing fold #', i)
val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
partial_train_data = np.concatenate([train_data[:i * num_val_samples],
train_data[(i + 1) * num_val_samples:]], axis=0)
partial_train_targets = np.concatenate([train_targets[:i * num_val_samples],
train_targets[(i + 1) * num_val_samples:]], axis=0)
model = build_model()
Regression example (K-Fold Validation)
history = model.fit(partial_train_data, partial_train_targets,
epochs=num_epochs, batch_size=1, verbose=0)
val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
all_scores.append(val_mae)
mae_history = history.history['mae']
all_mae_histories.append(mae_history)
Regression example (ERROR)
average_mae_history = [ np.mean([x[i] for x in all_mae_histories]) for i in
range(num_epochs)]
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()
Regression example (Final Model
Training)
model = build_model()
print(test_mae_score)
Fundamentals of
Machine Learning
Chapter 4
Four branches of machine learning
Supervised learning
This is by far the most common case. It consists of learning to map input data to
known targets (also called labels), given a set of examples (often labeled by
humans). All four examples we have encountered in this book so far are
canonical examples of supervised learning.
Unsupervised learning
Unsupervised learning is a type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses.
Target—The truth. What our model should ideally have predicted, according to
an external source of data
Classification and regression glossary
Prediction error or loss value—A measure of the difference between our
model’s prediction and the target
Vector regression—A task where the target is a set of continuous vector values
Classification and regression glossary
Mini-batch or batch—A small set of samples (typically between 8 and 128) that
are processed simultaneously by the model. The number of samples is often a
power of 2, to facilitate memory allocation on GPU. When training, a mini-batch is
used to compute a single gradient-descent update applied to the weights of the
model.
Evaluating machine-learning models
Generalization
Generalization usually refers to a ML model's ability to perform well on new
unseen data rather than just the data that it was trained on. The goal of a good
machine learning model is to generalize well from the training data to any data
from the problem domain. This allows us to make predictions in the future on
data the model has never seen.
Underfitting
Underfitting occurs when a model is too simple — informed by too few features
or regularized too much — which makes it inflexible in learning from the dataset.
Simple learners tend to have less variance in their predictions but more bias
towards wrong outcomes
Overfitting
Overfitting happens when a model learns the detail and noise in the training data
to the extent that it negatively impacts the performance of the model on new
data. This means that the noise or random fluctuations in the training data is
picked up and learned as concepts by the model. The problem is that these
concepts do not apply to new data and negatively impact the models ability to
generalize
Underfitting vs Overfitting vs
Generalization
Underfit Model. A model that fails to sufficiently learn the problem and
performs poorly on a training dataset and does not perform well on a holdout
sample
Overfit Model. A model that learns the training dataset too well, performing well
on the training dataset but does not perform well on a hold out sample
Good Fit Model. A model that suitably learns the training dataset and generalizes
well to the old out dataset
Training, Validation and Testing
Evaluating a model generally boils down to splitting the available data into three
sets: training, validation, and test. We train on the training data and evaluate our
model on the validation data. Once our model is ready for prime time, we test it
one final time on the test data. Training a model involves tuning its
hyperparameters, based on the validation loss / error. Every time we tune the
model to perform good on validation data, we leak some information about the
data into model. Hence as we achieve the desired validation accuracy we evaluate
the model on test dataset, which our model has not encountered during training
and validation.
Splitting Techniques For Dataset
SIMPLE HOLD - OUT VALIDATION: As we have seen in previous slides in this
technique we split dataset into three chunks, one for training, one for validation
and last one for testing.
Splitting Techniques For Dataset
K- FOLD VALIDATION: With this approach, we split our data into K partitions of
equal sizes. For each partition i , train the model on the remaining K – 1
partitions, and evaluate it on partition i . Our final score is then the average of the
K scores obtained. This method is helpful when the performance of our model
shows significant variance based on our train-test split. Like hold-out validation,
this method doesn’t exempt us from using a distinct validation set for model
calibration / testing
Splitting Techniques For Dataset
ITERATED K- FOLD VALIDATION WITH SHUFFLING: This one is for situations in
which we have relatively little data available and we need to evaluate our model
as precisely as possible. This approach has been found extremely helpful in
Kaggle competitions. It consists of applying K-fold validation multiple times,
shuffling the data every time before splitting it K ways. The final score is the
average of the scores obtained at each run of K-fold validation. Note that we end
up training and evaluating P × K models (where P is the number of iterations we
used), which can be very expensive.
Splitting Techniques (Things to keep in
mind)
Data representativeness—Our training set and test set must be representative
of the data at hand, they must contain examples of all the classes. For this
reason, we must randomly shuffle our data before splitting it into training and
test sets.
The arrow of time—If we’re trying to predict the future given the past (for
example, tomorrow’s weather, stock movements, and so on), we should not
shuffle our data before splitting it, because doing so will create a temporal leak:
our model will effectively be trained on data from the future. In such situations,
we should always make sure all data in our test set is posterior to the data in the
training set.
Splitting Techniques (Things to keep in
mind) in data—If some data points in our data appear twice (fairly
Redundancy
common with real-world data), then shuffling the data and splitting it into a
training set and a validation set will result in redundancy between the training
and validation sets. In effect, we’ll be testing on part of our training data, which is
the worst thing anyone can do! Make sure our training and validation sets are
disjoint.
Data preprocessing, feature engineering,
and feature learning
Data preprocessing for neural networks
Data preprocessing aims at making the raw data at hand more amenable to
neural networks. This includes Vectorization, Normalization, Handling Missing
values, and feature extraction
Data preprocessing (Vectorization)
All inputs and targets in a neural network must be tensors of floating-point data
(or, in specific cases, tensors of integers). Whatever data we need to process—
sound, images, text; we must first turn into tensors, this step is called data
Vectorization.
Data preprocessing (Normalization)
In general, it isn’t safe to feed into a neural network data that has large values (for
example, multidigit integers, which are much larger than the initial values taken
by the weights of a network) or data that is heterogeneous (for example, data
where one feature is in the range 0–1 and another is in the range 100–200). Doing
so can trigger large increase in weights during updates that will prevent the
network from converging.
Data preprocessing (Normalization)
To make learning easier for our network, our data should have the following
characteristics:
Be homogenous—That is, all features should take values in roughly the same
range.
Data preprocessing (Normalization)
Additionally, the following stricter normalization practice is common and can
help, although it isn’t always necessary (for example, we didn’t do this in the digit-
classification example):
x -= x.mean(axis=0)
x /= x.std(axis=0)
Data preprocessing (Handling Missing
Values)
Data in real world are rarely clean and homogeneous. Typically, they tend to be
incomplete, noisy, and inconsistent and it is an important task of a Data scientist
to preprocess the data by dealing with missing values properly. Missing values
could be: NaN, empty string, ?,-1,-99,-999 and so on. In general, with neural
networks, it’s safe to input missing values as 0, with the condition that 0 isn’t
already a meaningful value. The network will learn from exposure to the data that
the value 0 means missing data and will start ignoring the value
Handling Missing Values (Common
●Techniques)
Replace value, Backward fill, forward fill
● Replace value by mean, median or mode
● Drop records / samples having missing values
● Replace missing values using supervised learning, classification or regression
Data preprocessing (Feature engineering)
Feature engineering efforts mainly have two goals:
● Preparing the proper input dataset, compatible with the machine learning
algorithm requirements.
The features you use influence more than everything else the result. No algorithm
alone, to my knowledge, can supplement the information gain given by correct feature
engineering. — Luca Massaron (Data Scientist / Author / Google Developer Expert
in Machine Learning)
Overfitting and Underfitting
Overfitting and Underfitting
The fundamental issue in machine learning is the balancing between optimization
and generalization. Optimization refers to the process of adjusting a model to get
the best performance possible on the training data (the learning in machine
learning), whereas generalization refers to how well the trained model performs
on data it has never seen before.
Overfitting and Underfitting
Underfitting refers to a model that can neither model the training data nor
generalize to new data. An underfit machine learning model is not a suitable
model and will be obvious as it will have poor performance on the training data.
model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
Overcome Overfitting (Weight
regularization)
Always remember that the regularization penalty is applied at the time of training
only, when the weights are learned during training.
Overcome Overfitting (Add dropout
layers)
Dropout is one of the most effective and most commonly used regularization
techniques for neural networks, developed by Geoff Hinton and his students at
the University of Toronto. Dropout, applied to a layer, consists of randomly
dropping out (setting to zero) a number of output features of the layer during
training. The dropout rate is the fraction of the features that are zeroed out; it’s
usually set between 0.2 and 0.5. At test time, no units are dropped out; instead,
the layer’s output values are scaled down by a factor equal to the dropout rate, to
balance for the fact that more units are active than at training time. In keras we
add dropout layer as we add other layers
model.add(layers.Dropout(0.5))
Overcome Overfitting (Add dropout
layers)