100% found this document useful (3 votes)

477 views

Deep Learning

This document provides an overview of artificial intelligence, machine learning, and deep learning. It defines each topic and discusses their relationships. Machine learning pioneered using algorithms to infer rules from data instead of relying on hardcoded rules. Deep learning is a subset of machine learning that uses multilayered neural networks to learn complex patterns in large amounts of data. Recent advances in hardware, data availability, and algorithms have enabled deep learning to achieve near-human level performance in areas like image recognition and machine translation. The document outlines some of deep learning's achievements and why its approach is different from and more powerful than previous machine learning techniques.

Uploaded by

intelligence gateway

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

477 views

Deep Learning

Uploaded by

intelligence gateway

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 207

FUNDAMENTALS OF

DEEP LEARNING
Chapter 1
What will we cover
● What is Machine Learning
● Fundamental concepts involved in Machine Learning
● Four Branches of Machine Learning
● What is Deep Learning
● How it works
● What it can achieve
Text Book / Reference Book
● Deep Learning with Python, Authored by FRANÇOIS CHOLLET
CHAPTER-1
AI, Machine Learning & Deep Learning
AI, Machine Learning & Deep Learning
ARTIFICIAL INTELLIGENCE
Artificial Intelligence
Idea of AI was born when scientists started to think / program computers to
do the tasks only a human can do. For a long time Symbolic AI ruled the world
in which we maintain a large set of rules. Symbolic AI had certain limitations in
solving perception problems, like recognizing / tagging an image, translating a
language to another language, etc.
How good an AI algorithm is (Turing Test)
MACHINE LEARNING
Machine Learning
The frustration of crafting hard coded rules made the scientists to think what
if a program can infer the rules to describe the answers / results by itself. This
thought pioneered the field of Machine Learning.
Classical Programs
Machine Learning
Essential Things in Machine Learning
For machine learning, we need three things:

● Input data points

● Examples of the expected output
● A way to measure how good an algorithm is doing

In simple words Machine Learning is to learn useful representations of the

input data (representations that get us closer to the expected output)
DEEP LEARNING
Deep Learning
Deep learning (DL) is essentially a subset of ML that extends ML capabilities across multilayered neural
networks to go beyond just categorizing data. DL can actually learn, self-train, essentially from massive
amounts of data. With DL, it’s possible to combine the unique ability of computers to process massive
amounts of information quickly, with the human-like ability to take in, categorize, learn, and adapt.

Reference:
https://www.prowesscorp.com/whats-the-difference-between-artificial-intelligence-ai-machine-learning-and-deep-learning/
Deep Learning
In deep learning, these layered representations are (almost always) learned
via models called neural networks, structured in literal layers stacked on top
of each other.

The term neural network is a reference to neurobiology, but although some of

the central concepts in deep learning were developed in part by drawing
inspiration from our understanding of the brain, deep-learning models are
not models of the brain.
Neural Network for Handwriting
Recognition
So that’s what deep learning is, technically:

A multistage way to learn data representations. It’s a simple idea but, as it

turns out, very simple mechanisms, sufficiently scaled, can end up looking like
magic.
Understanding How Deep Learning
Works
Y = WX + B
Understanding How Deep Learning
Works
What deep learning has achieved so far
In particular, deep learning has achieved the following breakthroughs, all in
historically difficult areas of machine learning:

● Near-human-level image classification

● Near-human-level speech recognition
● Near-human-level handwriting transcription
● Improved machine translation
● Improved text-to-speech conversion
● Digital assistants such as Google Now and Amazon Alexa
What deep learning has achieved so far
● Near-human-level autonomous driving
● Improved ad targeting, as used by Google, Baidu, and Bing
● Improved search results on the web
● Ability to answer natural-language questions
● Superhuman Go playing
Is this an start towards Terminator Robots
BRIEF HISTORY OF MACHINE
LEARNING
Probabilistic Modeling
Probabilistic modeling is the application of the principles of statistics to data
analysis. It was one of the earliest forms of machine learning, and it’s still
widely used to this day. One of the best-known algorithms in this category is
the Naive Bayes algorithm.
Early Neural Networks
Although the core ideas of neural networks were investigated in toy forms as
early as the 1950s, the approach took decades to get started. For a long time,
the missing piece was an efficient way to train large neural networks.
Kernel Methods
Kernel methods are a group of classification algorithms,
the best known of which is the support vector machine
(SVM). SVMs aim at solving classification problems by
finding good decision boundaries between two sets of
points belonging to two different categories.

A decision boundary can be thought of as a line or

surface separating your training data into two spaces
corresponding to two categories. To classify new data
points, you just need to check which side of the decision
boundary they fall on.
DECISION TREE, RANDOM FOREST
AND BOOSTING MACHINES
Decision Trees
A decision tree is a hierarchical model in which every node splits the samples
into branches against a rule. Every leaf of the tree is the label / prediction of
the model. Before the rise of Deep learning; decision trees were the most
popular technique and preferred choice in ML practitioners and researchers.
They are able to capture linear and nonlinear relations in data. Another name
for Decision Tree is CART (Classification And Regression Tree).
Decision Trees Example
Random Forest
Random forest is a technique in which a large number of CARTs (decision
trees) are used to predict the outcome and a voting node is used to declare
the final label on majority of votes from individual trees.
Random Forest Example
Boosting Machines
Boosting Machines are algorithms in which a large number of weak learners
(Decision Trees) are used sequentially in such a way that every learner
reduces the error of its predecessor’s prediction. Two mostly used Boosting
Machines are Ada Boost and Gradient Boost.
Boosting Machine Illustration
BACK TO DEEP LEARNING
What Makes Deep Learning Different
● It offers better performance on many problems with higher accuracy than
classical techniques
● Makes Problem solving much easier by automating a very crucial and
time consuming step in Classical Machine Learning techniques that is
Feature Engineering.
What Makes Deep Learning Different
The two essential characteristics of how deep learning learns from data are

● The incremental, layer-by-layer way in which increasingly complex

representations are developed
● These intermediate incremental representations are learned jointly

Each layer being updated to follow both the representational needs of the
layer above and the needs of the layer below
Why Deep Learning? Why Now?
The two key ideas of deep learning for computer vision—convolutional neural
networks and backpropagation were already well understood in 1989. The
Long Short Term Memory (LSTM) algorithm, which is fundamental to deep
learning for time series, was developed in 1997 and has barely changed since.
So why did deep learning only take off after 2012?

What changed in these two decades?

Why Deep Learning? Why Now?
In general, these three technical forces are driving advances in machine
learning:

● Hardware
● Data
● Algorithmic advances
Hardware
In past few years introduction of GPU and vendor libraries to compute
complex tasks over GPU made the deep learning shine as complex tasks on a
large amount of data are solved in a considerably small time. During the last
year Google has introduced TPUs which are specifically designed for Deep
Learning tasks and are even 10x faster than a GPU.
Data
When it comes to data, in addition to the exponential progress in storage
hardware over the past 20 years (following Moore’s law), the game changer
has been the rise of the internet, making it feasible to collect and distribute
very large datasets for machine learning. Today, large companies work with
image datasets, video datasets, and natural-language datasets that couldn’t
have been collected without the internet. User-generated image tags on Flickr,
for instance, have been a treasure trove of data for computer vision. So are
YouTube videos. And Wikipedia is a key dataset for natural-language
processing.
Algorithms
In addition to hardware and data, until the late 2000s, we were missing a
reliable way to train very deep neural networks. As a result, neural networks
were still fairly shallow, using only one or two layers of representations; thus,
they weren’t able to shine against more-refined shallow methods such as
SVMs and random forests. The key issue was that of gradient propagation
through deep stacks of layers. The feedback signal used to train neural
networks would fade away as the number of layers increased.
A NEW WAVE OF INVESTMENT
A New Wave Of Investment
● AI and machine learning have the potential to create an additional $2.6T
in value by 2020 in Marketing and Sales, and up to $2T in manufacturing
and supply chain planning.
● Gartner predicts the business value created by AI will reach $3.9T in 2022.
● IDC predicts worldwide spending on cognitive and Artificial Intelligence
systems will reach $77.6B in 2022.

Reference:
https://www.forbes.com/sites/louiscolumbus/2019/03/27/roundup-of-machine-learning-forecasts-and-market-estimates-2019/#399b12c
07695
The Democratization Of Deep Learning
Introduction of new tools for languages that support Deep Learning made it
to approachable to a common developer with the knowledge of high level
scripting languages like Python.
Will it last?
Deep learning has several properties that justify its status as an AI revolution,
and it’s here to stay. We may not be using neural networks two decades from
now, but whatever we use will directly inherit from modern deep learning and
its core concepts.

These important properties can be broadly sorted into three categories:

● Simplicity
● Scalability
● Versatility and reusability
Simplicity
Deep learning removes the need for feature engineering, replacing complex,
brittle, engineering-heavy pipelines with simple, end-to-end trainable models
that are typically built using only five or six different tensor operations
Scalability
Deep learning is highly amenable to parallelization on GPUs or TPUs, so it can
take full advantage of Moore’s law. In addition, deep-learning models are
trained by iterating over small batches of data, allowing them to be trained on
datasets of arbitrary size. (The only bottleneck is the amount of parallel
computational power available, which, thanks to Moore’s law, is a fast moving
barrier.)
Versatility And Reusability
Unlike many prior machine-learning approaches, deep-learning models can
be trained on additional data without restarting from scratch, making them
viable for continuous online learning an important property for very large
production models. Furthermore, trained deep-learning models are
repurposable and thus reusable: for instance, it’s possible to take a deep
learning model trained for image classification and drop it into a video
processing pipeline. This allows us to reinvest previous work into increasingly
complex and powerful models. This also makes deep learning applicable to
fairly small datasets.
MATHEMATICAL
BUILDING BLOCKS
OF NEURAL
NETWORKS
Chapter 2
What will we cover
● A basic example of Neural Network
● What is a Tensor
● Tensor Operations
● How Neural Networks Learn
● Backpropagation
● Gradient Descent
Classes And Labels
In machine learning, a category in a classification problem is called a class.
Data points are called samples. The class associated with a specific sample is
called a label.
Classification Problem
A classification problem is when the output variable is a category, such as
“red” or “blue” or “disease” and “no disease”.

Reference: https://www.geeksforgeeks.org/regression-classification-supervised-machine-learning/
INTRODUCTION OF GOOGLE COLAB
(DEMO)
SETTING UP LOCAL ENVIRONMENT
(DEMO)
Testing Environment
import tensorflow as tf

print(tf.__version__)

# should print the version of tensorflow module

BASIC EXAMPLE OF NEURAL
NETWORK
Fashion mnist dataset
We will be using MNIST-like fashion product database, consisting of a training
set of 60,000 examples and a test set of 10,000 examples. Each example is a
28x28 grayscale image, associated with a label from 10 classes. Each training
and test example is assigned to one of the following labels:
Label Description Label Description

0 T-shirt/top 5 Sandal

1 Trouser 6 Shirt

2 Pullover 7 Sneaker

3 Dress 8 Bag

4 Coat 9 Ankle boot

Fashion MNIST Data Loading
The Fashion MNIST data is available directly in the tf.keras datasets API. We
can load it like this:
mnist = tf.keras.datasets.fashion_mnist

(training_images, training_labels), (test_images, test_labels) = mnist.load_data()

Data Normalization
In neural networks all input parameters are normalized to have a value
between 0 (Zero) and 1 (One). To do so we will divide every pixel value from
255, using our knowledge of numpy we can do it with a single line in Python:
training_images = training_images / 255.0

test_images = test_images / 255.0

Model Creation
model =
tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128,
activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)])

Here we have created a neural network of 3 layers:

Flatten: Serves as input layer and flattens the rectangular pic array in 1d array

Dense: Commonly known as hidden layer contains 128 neurons

Dense: Serves as output layer contains 10 neurons as we have 10 classes in

dataset
Training Model
Now it's time to train the model, observe the accuracy and loss at each epoch:
model.compile(optimizer = tf.keras.optimizers.Adam(),

loss = 'sparse_categorical_crossentropy',

metrics=['accuracy'])

model.fit(training_images, training_labels, epochs=10)

Evaluating Model
Once the model is trained we can measure it’s accuracy over the test set
which has not been observed by model. Again observe the accuracy and loss
of the model predictions:
model.evaluate(test_images, test_labels)
Basic Example Of Neural Network
import tensorflow as tf
mnist = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
training_images = training_images / 255.0
test_images = test_images / 255.0
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)])
model.compile(optimizer = tf.keras.optimizers.Adam(), loss =
'sparse_categorical_crossentropy', metrics = ['accuracy'])
model.fit(training_images, training_labels, epochs=5, batch_size=128)
model.evaluate(test_images, test_labels)
Data Representations For Neural
Networks
Basic building blocks for data passing into neural networks are numpy arrays,
also called tensors. At its core, a tensor is a container for data almost always
numerical data. In other words it’s a container for numbers. Tensors are a
generalization of matrices to an arbitrary number of dimensions (note that in
the context of tensors, a dimension is often called an axis). The number of
axes of a tensor is also called its rank.
Scalars (0D Tensors)
A tensor that contains only one number is called a scalar (or scalar tensor, or
0-dimensional tensor, or 0D tensor).

Here’s a Numpy scalar:

import numpy as np

x = np.array(12)

print(x) # prints array(12)

print(x.ndim) # prints 0
Vectors (1D Tensors)
An array of numbers is called a vector, or 1D tensor. A 1D tensor is said to
have exactly one axis. Following is a Numpy vector:

x = np.array([12, 3, 6, 14])

print(x) # => array([12, 3, 6, 14])

print(x.ndim) # => 1
Matrices (2D Tensors)
An array of vectors is a matrix, or 2D tensor. A matrix has two axes (often
referred to rows and columns). You can visually interpret a matrix as a
rectangular grid of numbers. This is a Numpy matrix:

x = np.array([[5, 78, 2, 34, 0],

[6, 79, 3, 35, 1],

[7, 80, 4, 36, 2]])

print(x.ndim ) # => 2
3d Tensors And Higher-dimensional
Tensors
If you pack such matrices in a new array, you obtain a 3D tensor, which you
can visually interpret as a cube of numbers. Following is a Numpy 3D tensor:
x = np.array([[[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]],
[[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]],
[[5, 78, 2, 34, 0],
[6, 79, 3, 35, 1],
[7, 80, 4, 36, 2]]])
print(x.ndim) # => 3
Key Attributes Of A Tensor
A tensor is defined by three key :

1. Number of axes (rank)

2. Shape
3. Data type
Number Of Axes (Rank)
This is also called the tensor’s ndim in Python libraries such as Numpy. For
instance, a 3D tensor has three axes, and a matrix has two axes .
Shape
This is a tuple of integers that describes how many dimensions the tensor has
along each axis.
Data Type
This is the type of the data contained in the tensor; for instance, a tensor’s
type could be float32, uint8, float64, and so on.
Manipulating Tensors In Numpy
All the operations we studied during numpy sessions, we can also perform
them on tensors, e.g. slicing, indexing, fancy indexing, reshape, dot (Matrix
multiplication), transpose, etc, etc
The Notion Of Data Batches
Deep-learning models don’t process an entire dataset at once; rather, they
break the data into small batches and apply those batches to the neural
network incrementally. Once a batch has passed the network and network
has adjusted its parameters according to this batch a new batch of next n
samples is extracted from dataset and applied over network.
Real-world Examples Of Data Tensors
Let’s make data tensors more concrete with a few examples similar to what
you’ll encounter later. The data you’ll manipulate will almost always fall into
one of the following categories:
Vector data: 2D tensors of shape (samples, features)
Time Series data or sequence data: 3D tensors of shape (samples, timesteps,
features)
Images: 4D tensors of shape (samples, height, width, channels) or (samples,
channels, height, width)
Video: 5D tensors of shape (samples, frames, height, width, channels) or
(samples, frames, channels, height, width)
Vector data
This is the most common case. In such a dataset, each single data point can
be encoded as a vector, and thus a batch of data will be encoded as a 2D
tensor (that is, an array of vectors), where the first axis is the samples axis and
the second axis is the features axis. For example:

● An actuarial dataset of people, where we consider each person’s age, ZIP

code, and income. Each person can be characterized as a vector of 3
values, and thus an entire dataset of 100,000 people can be stored in a 2D
tensor of shape (100000, 3).
Time Series data or sequence data
Whenever time matters in your data (or the notion of sequence order), it
makes sense to store it in a 3D tensor with an explicit time axis. Each sample
can be encoded as a sequence of vectors (a 2D tensor), and thus a batch of
data will be encoded as a 3D tensor.
Image data
Images typically have three dimensions: height, width, and color depth.
Although grayscale images (like our MNIST digits) have only a single color
channel and could thus be stored in 2D tensors, by convention image tensors
are always 3D. There are two conventions for shapes of images tensors: the
channels-last convention (used by TensorFlow) and the channels-first
convention (used by Theano). The TensorFlow machine-learning framework,
from Google, places the color-depth axis at the end: (samples, height, width,
color_depth). Meanwhile, Theano places the color depth axis right after the
batch axis: (samples, color_depth, height, width)
Video data
Video data is one of the few types of real-world data for which you’ll need 5D
tensors. A video can be understood as a sequence of frames, each frame
being a color image. Because each frame can be stored in a 3D tensor (height,
width, color_depth), a sequence of frames can be stored in a 4D tensor
(frames, height, width, color_depth), and thus a batch of different videos can
be stored in a 5D tensor of shape (samples, frames, height, width,
color_depth).
The Gears of Neural Networks
Tensor operations
Much as any computer program can be ultimately reduced to a small set of
binary operations on binary inputs (AND, OR, NOR, and so on), all
transformations learned by deep neural networks can be reduced to a
handful of tensor operations applied to tensors of numeric data. For instance,
it’s possible to add tensors, multiply tensors, and so on.
Element-wise operations
operations that are applied independently to each entry in the tensors being
considered are called Element-wise operations.
Broadcasting
What happens with addition when the shapes of the two tensors being added
differ?

When possible, and if there’s no ambiguity, the smaller tensor will be

broadcasted to match the shape of the larger tensor. Broadcasting consists of
two steps:

1. Axes (called broadcast axes) are added to the smaller tensor to match the
ndim of the larger tensor.
2. The smaller tensor is repeated alongside these new axes to match the full
shape of the larger tensor.
Tensor dot
The dot operation, also called a tensor product (not to be confused with an
elementwise product) is the most common, most useful tensor operation.
Contrary to element-wise operations, it combines entries in the input tensors.
Tensor reshaping
A third type of tensor operation that’s essential to understand is tensor
reshaping. Reshaping a tensor means rearranging its rows and columns to
match a target shape. Naturally, the reshaped tensor has the same total
number of coefficients as the initial tensor.

x = np.array([[0., 1.], [2., 3.], [4., 5.]])

print(x.shape) => (3, 2)

x = x.reshape((6, 1))

print(x) => [[ 0.], [ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]

Geometric interpretation of tensor
operations
Because the contents of the tensors manipulated by tensor operations can be
interpreted as coordinates of points in some geometric space, all tensor
operations have a geometric interpretation
A geometric interpretation of deep
learning
You just learned that neural networks consist entirely of chains of tensor
operations and that all of these tensor operations are just geometric
transformations of the input data. It follows that you can interpret a neural
network as a very complex geometric transformation in a high-dimensional
space, implemented via a long series of simple steps.
The Engine of Neural Networks:
Gradient-Based Optimization
As we have seen in the previous section, each neural layer from our first
network example transforms its input data as follows:

output = relu(dot(W, input) + b)

In this expression, W and b are tensors that are attributes of the layer. Initially,
these weight matrices are filled with small random values (a step called
random initialization). What comes next is to gradually adjust these weights,
based on a feedback signal. This gradual adjustment, also called training, is
basically the learning that machine learning is all about.
Gradient-Based Optimization
This happens within what’s called a training loop, which works as follows.
Repeat these steps in a loop, as long as necessary:

1. Draw a batch of training samples x and corresponding targets y.

2. Run the network on x (a step called the forward pass) to obtain
predictions y_pred.
3. Compute the loss of the network on the batch, a measure of the
mismatch between y_pred and y.
4. Update all weights of the network in a way that slightly reduces the loss
on this batch.
Gradient-Based Optimization
A much better approach is to take advantage of the fact that all operations
used in the network are differentiable, and compute the gradient of the loss
with regard to the network’s coefficients. You can then move the coefficients
in the opposite direction from the gradient, thus decreasing the loss.
What’s a derivative?
The slope a is called the derivative of f(x). If a is negative, it means a small
change of x around a point will result in a decrease of f(x) (as shown in figure);
and if a is positive, a small change in x will result in an increase of f(x)
Derivative of a Tensor Operation
A gradient is the derivative of a tensor operation. It’s the generalization of the
concept of derivatives to functions of multidimensional inputs: that is, to
functions that take tensors as inputs.
Stochastic gradient descent
Given a differentiable function, it’s theoretically possible to find its minimum
analytically: it’s known that a function’s minimum is a point where the
derivative is 0, so all you have to do is find all the points where the derivative
goes to 0 and check for which of these points the function has the lowest
value. Applied to a neural network, that means finding analytically the
combination of weight values that yields the smallest possible loss function.
This is a polynomial equation of N variables, where N is the number of
coefficients in the network. This is intractable for real neural networks, where
the number of parameters is never less than a few thousand and can often be
several tens of millions.
Stochastic gradient descent
Instead, you can use the four-step algorithm: modify the parameters little by
little based on the current loss value on a random batch of data. Because
you’re dealing with a differentiable function, you can compute its gradient,
which gives you an efficient way to implement step 4. If you update the
weights in the opposite direction from the gradient, the loss will be a little less
every time:
Mini-batch Stochastic Gradient Descent
The term stochastic refers to the fact that each batch of data is drawn at
random (stochastic is a scientific synonym of random)

1. Draw a batch of training samples x and corresponding targets y.

2. Run the network on x to obtain predictions y_pred.
3. Compute the loss of the network on the batch, a measure of the
mismatch between y_pred and y.
4. Compute the gradient of the loss with regard to the network’s parameters
(a backward pass).
5. Move the parameters a little in the opposite direction from the gradient
for example W -= step * gradient thus reducing the loss on the batch a bit.
Chaining Derivatives: Backpropagation
Backpropagation algorithms are a family of methods used to efficiently train
artificial neural networks (ANNs) following a gradient-based optimization
algorithm that exploits the chain rule. The main feature of backpropagation is
its iterative, recursive and efficient method for calculating the weights updates
to improve the network until it is able to perform the task for which it is being
trained.
Looking Back At Our First Example
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)])

Now we understand that this network consists of a chain of two Dense layers,
that each layer applies a few simple tensor operations to the input data, and
that these operations involve weight tensors. Weight tensors, which are
attributes of the layers, are where the knowledge of the network persists.
Looking Back At Our First Example
model.compile(optimizer = tf.keras.optimizers.Adam(), loss =
'sparse_categorical_crossentropy', metrics = ['accuracy'])

This was the network-compilation step: Now we understand that the loss
function is sparse_categorical_crossentropy that’s used as a feedback signal for
learning the weight tensors, and which the training phase will attempt to
minimize. We also know that this reduction of the loss happens via minibatch
stochastic gradient descent. The exact rules governing a specific use of
gradient descent are defined by the Adam optimizer passed as the first
argument.
Looking Back At Our First Example
Finally, this was the training loop:
model.fit(training_images, training_labels, epochs=5, batch_size=128)

Now we understand what happens when you call fit: the network will start to
iterate on the training data in mini-batches of 128 samples, 5 times over (each
iteration over all the training data is called an epoch). At each iteration, the
network will compute the gradients of the weights with regard to the loss on
the batch, and update the weights accordingly. At this point, we know most of
what there is to know about neural networks.
Getting Started
With Neural Networks
Chapter 3
Anatomy of a neural network
A neural network comprises of the following objects mainly:

● Layers, which are combined to form a network (or model)

● Input data and corresponding targets
● Loss function, which defines the feedback signal used for learning
● Optimizer, which determines how learning proceeds
Network, Layers, Loss function,
Optimizer Relation
Layers: The building blocks of deep
learning
The fundamental data structure in neural networks is the layer. A layer takes
as input one or more tensors and outputs one or more tensors. Usually layers
store the state or knowledge in form of weights. There are different types of
layers available for different tasks, like:

Dense Layers or Fully Connected Layers are used for 2D tensors of shape
(samples, features)

Recurrent layers are used for sequence or 3D tensors of shape (samples,

timesteps, features)
Layers: The building blocks of deep
learning
2D convolution layers (Conv2D) are used for Image data, stored in 4D tensors

The notion of layer compatibility refers to the fact that every layer will only
accept input tensors of a certain shape and will return output tensors of a
certain shape. Consider the following example:

layer = layers.Dense(32, input_shape=(784,))

This layer accepts 784 features (axis 0) and unspecified / any number of
samples (axis 1). The output of the layer is 32 at axis 1.
Layers: The building blocks of deep
learning
In Keras, Layer object has built in feature to adopt to the shape of its input
data. So the developer doesn’t have to worry about it. In example we
discussed previously,

model =
tf.keras.models.Sequential([tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128,
activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)])

Here we can observe that every layer defines what it will output but not what
it takes as input.
Models: networks of layers
A deep-learning model is a directed, acyclic graph of layers. The most
common instance is a linear stack of layers, mapping a single input to a single
output. But as you move forward, you’ll be exposed to a much broader variety
of network topologies. Some common ones include the following:

● Two-branch networks
● Multihead networks
● Inception blocks

Picking the right network architecture is more an art than a science; and
although there are some best practices and principles you can rely on, only
practice can help you become a proper neural-network architect
Loss functions and optimizers:
Once the network architecture is defined, you still have to choose two more
things:

1. Loss function (objective function)—The quantity that will be minimized

during training. It represents a measure of success for the task at hand.
2. Optimizer—Determines how the network will be updated based on the
loss function. It implements a specific variant of stochastic gradient
descent (SGD).
Choosing the right objective function for the right problem is extremely
important: your network will take any shortcut it can, to minimize the loss; so
if the objective doesn’t fully correlate with success for the task at hand, your
network will end up doing things you may not have wanted
Introduction to Keras
Keras is a deep-learning framework for Python that provides a convenient way
to define and train almost any kind of deep-learning model. Keras was initially
developed for researchers, with the aim of enabling fast experimentation.
Keras has the following key features:

1. It allows the same code to run seamlessly on CPU or GPU.

2. It has a user-friendly API that makes it easy to quickly prototype deep-
learning models.
3. It has built-in support for convolutional networks (for computer vision),
recurrent networks (for sequence processing), and any combination of
both.
Introduction to Keras
4. It supports arbitrary network architectures: multi-input or multi-output
models, layer sharing, model sharing, and so on. This means Keras is
appropriate for building essentially any deep-learning model, from a
generative adversarial network to a neural Turing machine.
Google trends for deep-learning
frameworks
Keras, TensorFlow, Theano, and CNTK
Keras is a model-level library, providing high-level building blocks for
developing deep-learning models. Low level implementation and processing is
done by backend engines Keras takes care model architecture at abstraction
layer.

Currently Keras uses tensorflow as its default backend and has tight
integration with tf.keras module in TensorFlow 2.0.

Via TensorFlow (or Theano, or CNTK), Keras is able to run seamlessly on both
CPUs and GPUs. When running on CPU, TensorFlow is itself wrapping a low-
level library for tensor operations called Eigen (http://eigen.tuxfamily.org)

.
Keras, TensorFlow, Theano, and CNTK
On GPU, TensorFlow wraps a library of well-optimized deep-learning
operations called the NVIDIA CUDA Deep Neural Network library (cuDNN)
Developing with Keras: a quick overview
We’ve already seen one example of a Keras model: the Fashion MNIST
example. The typical Keras workflow looks just like that example:

1. Define your training data: input tensors and target tensors.

2. Define a network of layers (or model ) that maps your inputs to your
targets.
3. Configure the learning process by choosing a loss function, an optimizer,
and some metrics to monitor.
4. Iterate on your training data by calling the fit() method of your model.
Developing with Keras: a quick overview
In keras there are two ways to create a model: one with instantiating
Sequential class, we already have seen and practiced this. While the other is
keras functional API way, which allows us to create arbitrary shaped networks
for advanced use cases.
inputs = keras.Input(shape=(784,), name='img')
x = layers.Dense(64, activation='relu')(inputs)
x = layers.Dense(64, activation='relu')(x)
outputs = layers.Dense(10, activation='softmax')(x)

model = keras.Model(inputs=inputs, outputs=outputs,

name='mnist_model')
Developing with Keras: a quick overview
(x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()
x_train = x_train.reshape(60000, 784).astype('float32') / 255
x_test = x_test.reshape(10000, 784).astype('float32') / 255

model.compile(loss='sparse_categorical_crossentropy',
optimizer=keras.optimizers.RMSprop(), metrics=['accuracy'])

history = model.fit(x_train, y_train,

batch_size=64, epochs=5, validation_split=0.2)

test_scores = model.evaluate(x_test, y_test, verbose=2)

print('Test loss:', test_scores[0])
print('Test accuracy:', test_scores[1])
Setting up a deep-learning workstation
(Demo)
Jupyter notebooks and Google Colab
(Demo)
Running over CPU Vs GPU, Local Vs
Cloud
Classifying movie reviews
A binary classification example
In this example, We’ll learn to classify movie
reviews as positive or negative, based on the text
content of the reviews.
Classifying movie reviews (import
modules)
import tensorflow as tf

from tensorflow.keras.datasets import imdb

from tensorflow.keras import models, layers, optimizers

import numpy as np

import matplotlib.pyplot as plt

Classifying movie reviews (Loading
Dataset)
(train_data, train_labels), (test_data, test_labels) =
imdb.load_data(num_words=10000)

print(train_data[0])

print(train_labels[0])

#If training data is from Movie reviews it must have been in text, how come it is
numbers. Argument num_words restricts function to load only 10,000 most used
words.
Contd. (Loading Dataset)
max([max(sequence) for sequence in train_data])

# As Argument num_words restricts function to load only 10,000 most used words.
To check we can execute the above snippet it will display the max word index in
train_data
Contd. (Preparing the data)
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1
return results
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
Contd. (Building the network)
model = models.Sequential()

model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))

model.add(layers.Dense(16, activation='relu'))

model.add(layers.Dense(1, activation='sigmoid'))
Contd. (Building the network)
Because we are facing a binary classification problem and the output of our
network is a probability (we end our network with a single-unit layer with a
sigmoid activation), it’s best to use the binary_crossentropy loss. It isn’t the only
viable choice: we could use, for instance, mean_squared_error . But crossentropy
is usually the best choice when we’re dealing with models that output
probabilities. Crossentropy is a quantity from the field of Information Theory that
measures the distance between probability distributions or, in this case, between
the ground-truth distribution and our predictions. In next slide is the step where
we configure the model with the rmsprop optimizer and the binary_crossentropy
loss function. Note that we’ll also monitor accuracy during training.
Contd. (Compile the network)
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512,

validation_data=(x_val, y_val))
Contd. (Plotting the training and
validation loss)
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf()
Contd. (Plotting the training and
validation loss)
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc_values, 'bo', label='Training acc')
plt.plot(epochs, val_acc_values, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Contd. (Predictions on new data)
model.predict(x_test)

# array([[0.00579366],

[1. ],

[0.4512035 ],

...,

[0.00185409],

[0.00611657],

[0.44180435]], dtype=float32)
Classifying newswires
A multiclass classification example
In this section, we’ll build a network to classify Reuters newswires into 46
mutually exclusive topics. Because we have many classes, this problem is an
instance of multi-class classification.
Single-label, multiclass classification
If each data point should be classified into one and only one category, the
problem is more specifically an instance of single-label, multiclass classification.
Multilabel, multiclass classification
If each data point could belong to multiple categories (in this case, topics), this
case is generally known as multilabel, multiclass classification problem.
The Reuters dataset
We’ll work with the Reuters dataset, a set of short newswires and their topics,
published by Reuters in 1986. It’s a simple, widely used toy dataset for text
classification. There are 46 different topics; some topics are more represented
than others, but each topic has at least 10 examples in the training set. Like IMDB
and MNIST , the Reuters dataset comes packaged as part of Keras
Classifying newswires (import modules)
import tensorflow as tf

from tensorflow.keras.datasets import reuters

from tensorflow.keras import models, layers, optimizers, utils

import numpy as np

import matplotlib.pyplot as plt

Classifying newswires (Loading Dataset)
(train_data, train_labels), (test_data, test_labels) =
reuters.load_data(num_words=10000)

print(train_data[0])

print(train_labels[0])
Classifying newswires (Loading Dataset)
As with the IMDB dataset, the argument num_words=10000 restricts the data to
the 10,000 most frequently occurring words found in the data. We have 8,982
training examples and 2,246 test examples:

print(len(train_data)) # 8982

print(len(test_data)) # 2246
Classifying newswires (Preparing the
data)
def vectorize_sequences(sequences, dimension=10000):

results = np.zeros((len(sequences), dimension))

for i, sequence in enumerate(sequences):

results[i, sequence] = 1.

return results

x_train = vectorize_sequences(train_data)

x_test = vectorize_sequences(test_data)
Classifying newswires (Preparing the
data)
def to_one_hot(labels, dimension=46):

results = np.zeros((len(labels), dimension))

for i, label in enumerate(labels):

results[i, label] = 1.

return results

one_hot_train_labels = to_one_hot(train_labels)

one_hot_test_labels = to_one_hot(test_labels)
Classifying newswires (Preparing the
data)
Instead of the approach in last slide we can create one hot encoded labels with
these simple snippets.

one_hot_train_labels = utils.to_categorical(train_labels)

one_hot_test_labels = utils.to_categorical(test_labels)
Classifying newswires (Preparing
Validation
Instead set)
of the approach in last slide we can create one hot encoded labels with
these simple snippets.

x_val = x_train[:1000]

partial_x_train = x_train[1000:]

y_val = one_hot_train_labels[:1000]

partial_y_train = one_hot_train_labels[1000:]
Classifying newswires (Building the
network)
model = models.Sequential()

model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))

model.add(layers.Dense(64, activation='relu'))

model.add(layers.Dense(46, activation='softmax'))

Observe the difference from the network we created for imdb network, at the
output layer we have softmax activation function instead of sigmoid other than
layer size.
Classifying newswires (Building the
network)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
metrics=['acc'])

history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512,

Predictions.shape # (2246, 46)

predictions[1].sum() # 1.000000

predictions[1].argmax() # 14
Contd. (using integer labels not one hot
encoding)
Another way to encode the labels would be to cast them as an integer tensor, like
this:

y_train = np.array(train_labels)

y_test = np.array(test_labels)

With integer labels, we should use sparse_categorical_crossentropy instead of

categorical_crossentropy

model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy',
metrics=['acc'])
Importance of Layer size
As there are 46 possible classes we should have intermediate layers greater or
equal to 46. If we would have any intermediate layer size lesser than 46 will result
in drop of accuracy.
Predicting house prices: Regression
example
K Fold Cross Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data
sample.

The general procedure is as follows:

1. Shuffle the dataset randomly.

2. Split the dataset into k groups
3. For each unique group:
1. Take the group as a hold out or test data set
2. Take the remaining groups as a training data set
3. Fit a model on the training set and evaluate it on the test set
4. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores
Regression example (import modules)
import tensorflow as tf

from tensorflow.keras.datasets import boston_housing

from tensorflow.keras import models, layers, optimizers, utils

import numpy as np

import matplotlib.pyplot as plt

Regression example (Loading Dataset)
(train_data, train_targets), (test_data, test_targets) =
boston_housing.load_data()

print(train_data.shape)

print(test_data.shape)

print(train_targets)
Regression example (Normalizing the
data)
mean = train_data.mean(axis=0)

train_data -= mean

std = train_data.std(axis=0)

train_data /= std

test_data -= mean

test_data /= std
Regression example (Building the
network)
def build_model():

model = models.Sequential()

model.add(layers.Dense(64, activation='relu',
input_shape=(train_data.shape[1],)))

model.add(layers.Dense(64, activation='relu'))

model.add(layers.Dense(1))

model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])

return model
Regression example (K-Fold Validation)
k = 4
num_val_samples = len(train_data) // k
num_epochs = 100
all_scores = []
all_mae_histories = []
for i in range(k):
print('processing fold #', i)
val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
partial_train_data = np.concatenate([train_data[:i * num_val_samples],
train_data[(i + 1) * num_val_samples:]], axis=0)
partial_train_targets = np.concatenate([train_targets[:i * num_val_samples],
train_targets[(i + 1) * num_val_samples:]], axis=0)
model = build_model()
Regression example (K-Fold Validation)
history = model.fit(partial_train_data, partial_train_targets,
epochs=num_epochs, batch_size=1, verbose=0)
val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
all_scores.append(val_mae)
mae_history = history.history['mae']
all_mae_histories.append(mae_history)
Regression example (ERROR)
average_mae_history = [ np.mean([x[i] for x in all_mae_histories]) for i in
range(num_epochs)]

print('All fold Validation Errors: ', all_scores)

print('Mean of All fold Validation Errors: ', np.mean(all_scores))

print('average_mae_history: ', average_mae_history)

Regression example (Plotting ERROR)
plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)

plt.xlabel('Epochs')

plt.ylabel('Validation MAE')

plt.show()
Regression example (Final Model
Training)
model = build_model()

model.fit(train_data, train_targets, epochs=80, batch_size=16, verbose=0)

test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)

print(test_mae_score)
Fundamentals of
Machine Learning
Chapter 4
Four branches of machine learning
Supervised learning
This is by far the most common case. It consists of learning to map input data to
known targets (also called labels), given a set of examples (often labeled by
humans). All four examples we have encountered in this book so far are
canonical examples of supervised learning.
Unsupervised learning
Unsupervised learning is a type of machine learning algorithm used to draw
inferences from datasets consisting of input data without labeled responses.

The most common unsupervised learning method is cluster analysis, which is

used for exploratory data analysis to find hidden patterns or grouping in data.
Self-supervised learning
Self-supervised learning (or self-supervision) is a relatively recent learning
technique (in machine learning) where the training data is autonomously (or
automatically) labelled. It is still supervised learning, but the datasets do not need
to be manually labelled by a human, instead they are / can be labelled by finding
and exploiting the relations (or correlations) between different inputs.
Reinforcement learning
In reinforcement learning, an agent receives information about its environment
and learns to choose actions that will maximize some reward. For instance, a
neural network that looks at a video game screen and outputs game actions in
order to maximize its score can be trained via reinforcement learning.
Classification and regression glossary
Until now we have come across many terms involved in aggregation and
classification, let's have a short recap of them:

Sample or input—Input of the Model

Prediction or output—Output of the Model

Target—The truth. What our model should ideally have predicted, according to
an external source of data
Classification and regression glossary
Prediction error or loss value—A measure of the difference between our
model’s prediction and the target

Classes—A set of possible labels to choose from in a classification problem

Label—A specific instance of a class annotation in a classification problem.

Binary classification—A classification task where each input sample should be

categorized into one of two exclusive categories
Classification and regression glossary
Multiclass classification—A classification task where each input sample should
be categorized into one category from a choice of more than two categories

Multilabel classification—A classification task where each input sample can be

assigned multiple labels

Scalar regression—A task where the target is a continuous scalar value.

Predicting house prices is a good example

Vector regression—A task where the target is a set of continuous vector values
Classification and regression glossary
Mini-batch or batch—A small set of samples (typically between 8 and 128) that
are processed simultaneously by the model. The number of samples is often a
power of 2, to facilitate memory allocation on GPU. When training, a mini-batch is
used to compute a single gradient-descent update applied to the weights of the
model.
Evaluating machine-learning models
Generalization
Generalization usually refers to a ML model's ability to perform well on new
unseen data rather than just the data that it was trained on. The goal of a good
machine learning model is to generalize well from the training data to any data
from the problem domain. This allows us to make predictions in the future on
data the model has never seen.
Underfitting
Underfitting occurs when a model is too simple — informed by too few features
or regularized too much — which makes it inflexible in learning from the dataset.
Simple learners tend to have less variance in their predictions but more bias
towards wrong outcomes
Overfitting
Overfitting happens when a model learns the detail and noise in the training data
to the extent that it negatively impacts the performance of the model on new
data. This means that the noise or random fluctuations in the training data is
picked up and learned as concepts by the model. The problem is that these
concepts do not apply to new data and negatively impact the models ability to
generalize
Underfitting vs Overfitting vs
Generalization
Underfit Model. A model that fails to sufficiently learn the problem and
performs poorly on a training dataset and does not perform well on a holdout
sample

Overfit Model. A model that learns the training dataset too well, performing well
on the training dataset but does not perform well on a hold out sample

Good Fit Model. A model that suitably learns the training dataset and generalizes
well to the old out dataset
Training, Validation and Testing
Evaluating a model generally boils down to splitting the available data into three
sets: training, validation, and test. We train on the training data and evaluate our
model on the validation data. Once our model is ready for prime time, we test it
one final time on the test data. Training a model involves tuning its
hyperparameters, based on the validation loss / error. Every time we tune the
model to perform good on validation data, we leak some information about the
data into model. Hence as we achieve the desired validation accuracy we evaluate
the model on test dataset, which our model has not encountered during training
and validation.
Splitting Techniques For Dataset
SIMPLE HOLD - OUT VALIDATION: As we have seen in previous slides in this
technique we split dataset into three chunks, one for training, one for validation
and last one for testing.
Splitting Techniques For Dataset
K- FOLD VALIDATION: With this approach, we split our data into K partitions of
equal sizes. For each partition i , train the model on the remaining K – 1
partitions, and evaluate it on partition i . Our final score is then the average of the
K scores obtained. This method is helpful when the performance of our model
shows significant variance based on our train-test split. Like hold-out validation,
this method doesn’t exempt us from using a distinct validation set for model
calibration / testing
Splitting Techniques For Dataset
ITERATED K- FOLD VALIDATION WITH SHUFFLING: This one is for situations in
which we have relatively little data available and we need to evaluate our model
as precisely as possible. This approach has been found extremely helpful in
Kaggle competitions. It consists of applying K-fold validation multiple times,
shuffling the data every time before splitting it K ways. The final score is the
average of the scores obtained at each run of K-fold validation. Note that we end
up training and evaluating P × K models (where P is the number of iterations we
used), which can be very expensive.
Splitting Techniques (Things to keep in
mind)
Data representativeness—Our training set and test set must be representative
of the data at hand, they must contain examples of all the classes. For this
reason, we must randomly shuffle our data before splitting it into training and
test sets.

The arrow of time—If we’re trying to predict the future given the past (for
example, tomorrow’s weather, stock movements, and so on), we should not
shuffle our data before splitting it, because doing so will create a temporal leak:
our model will effectively be trained on data from the future. In such situations,
we should always make sure all data in our test set is posterior to the data in the
training set.
Splitting Techniques (Things to keep in
mind) in data—If some data points in our data appear twice (fairly
Redundancy
common with real-world data), then shuffling the data and splitting it into a
training set and a validation set will result in redundancy between the training
and validation sets. In effect, we’ll be testing on part of our training data, which is
the worst thing anyone can do! Make sure our training and validation sets are
disjoint.
Data preprocessing, feature engineering,
and feature learning
Data preprocessing for neural networks
Data preprocessing aims at making the raw data at hand more amenable to
neural networks. This includes Vectorization, Normalization, Handling Missing
values, and feature extraction
Data preprocessing (Vectorization)
All inputs and targets in a neural network must be tensors of floating-point data
(or, in specific cases, tensors of integers). Whatever data we need to process—
sound, images, text; we must first turn into tensors, this step is called data
Vectorization.
Data preprocessing (Normalization)
In general, it isn’t safe to feed into a neural network data that has large values (for
example, multidigit integers, which are much larger than the initial values taken
by the weights of a network) or data that is heterogeneous (for example, data
where one feature is in the range 0–1 and another is in the range 100–200). Doing
so can trigger large increase in weights during updates that will prevent the
network from converging.
Data preprocessing (Normalization)
To make learning easier for our network, our data should have the following
characteristics:

Take small values—Typically, most values should be in the 0–1 range.

Be homogenous—That is, all features should take values in roughly the same
range.
Data preprocessing (Normalization)
Additionally, the following stricter normalization practice is common and can
help, although it isn’t always necessary (for example, we didn’t do this in the digit-
classification example):

● Normalize each feature independently to have a mean of 0.

● Normalize each feature independently to have a standard deviation of 1.

This is easy to do with Numpy arrays:

x -= x.mean(axis=0)

x /= x.std(axis=0)
Data preprocessing (Handling Missing
Values)
Data in real world are rarely clean and homogeneous. Typically, they tend to be
incomplete, noisy, and inconsistent and it is an important task of a Data scientist
to preprocess the data by dealing with missing values properly. Missing values
could be: NaN, empty string, ?,-1,-99,-999 and so on. In general, with neural
networks, it’s safe to input missing values as 0, with the condition that 0 isn’t
already a meaningful value. The network will learn from exposure to the data that
the value 0 means missing data and will start ignoring the value
Handling Missing Values (Common
●Techniques)
Replace value, Backward fill, forward fill
● Replace value by mean, median or mode
● Drop records / samples having missing values
● Replace missing values using supervised learning, classification or regression
Data preprocessing (Feature engineering)
Feature engineering efforts mainly have two goals:

● Preparing the proper input dataset, compatible with the machine learning
algorithm requirements.

● Improving the performance of machine learning models.

The features you use influence more than everything else the result. No algorithm
alone, to my knowledge, can supplement the information gain given by correct feature
engineering. — Luca Massaron (Data Scientist / Author / Google Developer Expert
in Machine Learning)
Overfitting and Underfitting
Overfitting and Underfitting
The fundamental issue in machine learning is the balancing between optimization
and generalization. Optimization refers to the process of adjusting a model to get
the best performance possible on the training data (the learning in machine
learning), whereas generalization refers to how well the trained model performs
on data it has never seen before.
Overfitting and Underfitting
Underfitting refers to a model that can neither model the training data nor
generalize to new data. An underfit machine learning model is not a suitable
model and will be obvious as it will have poor performance on the training data.

Whereas Overfitting is a modeling error that occurs when a function is too

closely fit to a limited set of data points
Overcome Underfitting
To cope with underfitting we can:

● Introduce more complexity in dataset like:

○ add more layers
○ increase layer sizes
● Add more features to dataset
Overcome Overfitting
To cope with overfitting we can:

● Get more training data

● Reducing the network’s size
● Add regularization
● Add dropout layers
Overcome Overfitting (Get more training
Todata)
prevent a model from learning misleading or irrelevant patterns found in the
training data, the best solution is to get more training data. A model trained on
more data will naturally generalize better.
Overcome Overfitting (Reducing the
Innetwork’s size)
deep learning, the number of learnable parameters in a model is often
referred to as the model’s capacity. Intuitively, a model with more parameters
has more memorization capacity and therefore can easily learn a perfect
dictionary-like mapping between training samples and their targets a mapping
without any generalization power. On the other hand, if the network has limited
memorization resources, it won’t be able to learn this mapping as easily; thus, in
order to minimize its loss, it will have to resort to learning compressed
representations that have predictive power regarding the targets precisely the
type of representations we’re interested in.
Demo on imdb Dataset
Overcome Overfitting (Weight
A regularization)
simple model is a model where the distribution of parameter values has less
entropy (or a model with fewer parameters, as we saw in the previous section).
Thus a common way to mitigate overfitting is to put constraints on the
complexity of a network by forcing its weights to take only small values, which
makes the distribution of weight values more regular, this is called weight
regularization.
Overcome Overfitting (Weight
regularization)
Weight regularization is done by adding to the loss function of the network a cost
associated with having large weights. This cost comes in two flavors:

● L1 regularization: The cost added is proportional to the absolute value of

the weight coefficients (the L1 norm of the weights).
● L2 regularization: The cost added is proportional to the square of the value
of the weight coefficients (the L2 norm of the weights).
Overcome Overfitting (Weight
Inregularization)
Keras, weight regularization is added by passing weight regularizer instances to
layers as keyword arguments, e.g

model = models.Sequential()

model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
activation='relu', input_shape=(10000,)))

model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
activation='relu'))

model.add(layers.Dense(1, activation='sigmoid'))
Overcome Overfitting (Weight
regularization)
Always remember that the regularization penalty is applied at the time of training
only, when the weights are learned during training.
Overcome Overfitting (Add dropout
layers)
Dropout is one of the most effective and most commonly used regularization
techniques for neural networks, developed by Geoff Hinton and his students at
the University of Toronto. Dropout, applied to a layer, consists of randomly
dropping out (setting to zero) a number of output features of the layer during
training. The dropout rate is the fraction of the features that are zeroed out; it’s
usually set between 0.2 and 0.5. At test time, no units are dropped out; instead,
the layer’s output values are scaled down by a factor equal to the dropout rate, to
balance for the fact that more units are active than at training time. In keras we
add dropout layer as we add other layers
model.add(layers.Dropout(0.5))
Overcome Overfitting (Add dropout
layers)

Instant download Transformers for Natural Language Processing and Computer Vision, Third Edition Denis Rothman pdf all chapter
67% (3)
Instant download Transformers for Natural Language Processing and Computer Vision, Third Edition Denis Rothman pdf all chapter
55 pages
Machine Learning Short Notes
No ratings yet
Machine Learning Short Notes
36 pages
Deep Learning
No ratings yet
Deep Learning
243 pages
Learning Rules
No ratings yet
Learning Rules
60 pages
Deep Learning Notes Andrew NG
No ratings yet
Deep Learning Notes Andrew NG
54 pages
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
No ratings yet
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick PDF
39 pages
Deep Learning For Computer Vision PDF
7% (14)
Deep Learning For Computer Vision PDF
24 pages
Deep Learning
100% (3)
Deep Learning
32 pages
Deep Learning Unit1
No ratings yet
Deep Learning Unit1
63 pages
Deep Learning Unit-II
No ratings yet
Deep Learning Unit-II
19 pages
Deep Learning
No ratings yet
Deep Learning
43 pages
Deep Learning Questions
50% (2)
Deep Learning Questions
51 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
16 pages
Deep Learning PPT Full Notes
100% (2)
Deep Learning PPT Full Notes
105 pages
MLOps Syllabus and Weekly Schedule (June 2021) PDF
No ratings yet
MLOps Syllabus and Weekly Schedule (June 2021) PDF
5 pages
Introduction of Neural Network
No ratings yet
Introduction of Neural Network
31 pages
Soft Computing (SC) Topper Solution
100% (2)
Soft Computing (SC) Topper Solution
35 pages
DL Full Merged
No ratings yet
DL Full Merged
454 pages
Supervised Learning Network: "Principles of Soft Computing, 2
No ratings yet
Supervised Learning Network: "Principles of Soft Computing, 2
30 pages
The Mostly Complete Chart of Neural Networks
100% (1)
The Mostly Complete Chart of Neural Networks
19 pages
Math of Deep Learning Neural Networks
No ratings yet
Math of Deep Learning Neural Networks
9 pages
The Best LLMs Cheatsheet - Part 1
No ratings yet
The Best LLMs Cheatsheet - Part 1
16 pages
GANppt
100% (1)
GANppt
34 pages
Deep Learning
100% (4)
Deep Learning
100 pages
22 Selected Top Papers On Deep Learning
No ratings yet
22 Selected Top Papers On Deep Learning
393 pages
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
No ratings yet
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
73 pages
Basics of Deep Learning
100% (1)
Basics of Deep Learning
17 pages
Ai and Machine Learning For Business
No ratings yet
Ai and Machine Learning For Business
114 pages
Deep Learning Book
100% (5)
Deep Learning Book
42 pages
Ebook Deep Learning Objective Type Questions
No ratings yet
Ebook Deep Learning Objective Type Questions
102 pages
Machine Learning Techniques
100% (2)
Machine Learning Techniques
45 pages
Machine Learning-Unit-V-Notes
No ratings yet
Machine Learning-Unit-V-Notes
23 pages
Artificial - Intelligence - Master Program - Slimup
No ratings yet
Artificial - Intelligence - Master Program - Slimup
25 pages
Neural Networks
No ratings yet
Neural Networks
54 pages
Cours 1 - Intro To Deep Learning
100% (1)
Cours 1 - Intro To Deep Learning
38 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
71 pages
Chapter #5 - Deep Learning
No ratings yet
Chapter #5 - Deep Learning
34 pages
Deep Learning Andrew NG
100% (3)
Deep Learning Andrew NG
173 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Introduction To Deep Learning
No ratings yet
Introduction To Deep Learning
49 pages
Neural Network
No ratings yet
Neural Network
58 pages
Deep Learning Interview Questions
No ratings yet
Deep Learning Interview Questions
17 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
47 pages
Introduction To Machine Learning
100% (1)
Introduction To Machine Learning
46 pages
Lec16 - Autoencoders
No ratings yet
Lec16 - Autoencoders
18 pages
Machine Learning Handouts
No ratings yet
Machine Learning Handouts
110 pages
Deep Learning Unit-1 Finals
No ratings yet
Deep Learning Unit-1 Finals
23 pages
Deep Learning Patterns and Practices 1st Edition Andrew Ferlitsch 2024 scribd download
100% (3)
Deep Learning Patterns and Practices 1st Edition Andrew Ferlitsch 2024 scribd download
40 pages
Machine Learning
100% (2)
Machine Learning
211 pages
Cours 4 - Loading and Preprocessing Data With TensorFlow
No ratings yet
Cours 4 - Loading and Preprocessing Data With TensorFlow
23 pages
Machine Learning
100% (2)
Machine Learning
16 pages
Statistics in Details
100% (2)
Statistics in Details
283 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
32 pages
CS3491 Unit 1 Aiml
No ratings yet
CS3491 Unit 1 Aiml
47 pages
A Probabilistic Theory of Deep Learning: Unit 2
No ratings yet
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
Artificial Intelligence & Machine Learning Digital Notes
100% (1)
Artificial Intelligence & Machine Learning Digital Notes
116 pages
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
From Everand
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
Abhishek Vijayvargia
No ratings yet
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet
Deep Learning PIAIC
100% (1)
Deep Learning PIAIC
229 pages
Random Forest
No ratings yet
Random Forest
18 pages
4.ques and Answers
No ratings yet
4.ques and Answers
5 pages
Data Mining - Ensemble Methods
No ratings yet
Data Mining - Ensemble Methods
12 pages
Email Spam Detection (Research Paper)
No ratings yet
Email Spam Detection (Research Paper)
8 pages
A Comprehensive Guide To Ensemble Learning (With Python Codes)
No ratings yet
A Comprehensive Guide To Ensemble Learning (With Python Codes)
22 pages
Download Full Computer vision: theory, algorithms, practicalities Fifth Edition Davies PDF All Chapters
100% (1)
Download Full Computer vision: theory, algorithms, practicalities Fifth Edition Davies PDF All Chapters
51 pages
ResearchPaper
No ratings yet
ResearchPaper
18 pages
Enhancing Android Malware Detection Throught Ensemble Stakcking
No ratings yet
Enhancing Android Malware Detection Throught Ensemble Stakcking
11 pages
On Machine Learning Based Cryptocurrency Trading
No ratings yet
On Machine Learning Based Cryptocurrency Trading
121 pages
Data Mining Yee Ling Boo All Chapters Instant Download
100% (1)
Data Mining Yee Ling Boo All Chapters Instant Download
55 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
33 pages
Real Time Face Detection
No ratings yet
Real Time Face Detection
70 pages
Advanced Certificate Programme DS (2) - Hina Aswani
No ratings yet
Advanced Certificate Programme DS (2) - Hina Aswani
34 pages
STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
16 pages
Seckin Et Al 2019 Production Fault Simulation and Forecasting From Time Series Data With Machine Learning in Glove
No ratings yet
Seckin Et Al 2019 Production Fault Simulation and Forecasting From Time Series Data With Machine Learning in Glove
12 pages
Predictive Analysis of Diabetes Without Data Pre-Processing Via The Evaluation of Tree Algorithms
No ratings yet
Predictive Analysis of Diabetes Without Data Pre-Processing Via The Evaluation of Tree Algorithms
11 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Ensemble Regression Models Applied To Dropout in Higher Education
No ratings yet
Ensemble Regression Models Applied To Dropout in Higher Education
6 pages
1 s2.0 S2667305323000273 Main
No ratings yet
1 s2.0 S2667305323000273 Main
18 pages
A Simple Proof of AdaBoost Algorithm
No ratings yet
A Simple Proof of AdaBoost Algorithm
4 pages
Maching Learning Exercise
No ratings yet
Maching Learning Exercise
4 pages
Variable Importance Analysis in Imbalanced Datasets A New Approach
No ratings yet
Variable Importance Analysis in Imbalanced Datasets A New Approach
27 pages
Machine Learning with Spark and Python Essential Techniques for Predictive Analytics 2nd Edition Michael Bowles download
100% (2)
Machine Learning with Spark and Python Essential Techniques for Predictive Analytics 2nd Edition Michael Bowles download
35 pages
Final report of mini project
No ratings yet
Final report of mini project
52 pages
ML Questions
No ratings yet
ML Questions
56 pages
Logistic Regression in Data Analysis: An Overview
No ratings yet
Logistic Regression in Data Analysis: An Overview
21 pages
MSC Salunkhe T P 2018
No ratings yet
MSC Salunkhe T P 2018
79 pages
A Brief History of Artificial Intelligence: The First AI Winter
No ratings yet
A Brief History of Artificial Intelligence: The First AI Winter
4 pages
Report On ML NEW Project
No ratings yet
Report On ML NEW Project
5 pages
Mini Question bank 6th sem-1903BS005-MACHINE LEARNING
No ratings yet
Mini Question bank 6th sem-1903BS005-MACHINE LEARNING
3 pages