AI and Deep Learning
AI and Deep Learning
The history of AI has now become a long one. Its birth coincides with the publication of the
question “Can machines think?”. In fact, this phrase used by Alan Turing in the imitation
game is considered the beginning of AI.
On the other hand, the term owes its partnership to John McCarthy, a computer scientist
who, in 1956, organized the Dartmouth conference in which the term was officially coined.
The initial enthusiasm was followed by the so-called “AI winter”; a period identified from the
1970s to the 1990s, in which problems related to the capabilities of the avail-able
instrumentation have created an abrupt halt.
Later, thanks to technological advancement, starting from the 2010s, AI is having a new
renaissance. And in this new “AI spring”, AI in Medicine (AIM) had no exceptions.
This was also possible thanks to the wide-spread health data digitalization, which made it
possible to create big data systems capable of providing a solid basis for intelligent
algorithms.
Borges do Nascimento et al., analyzing the impact of big data analysis on health indicators
and core priorities described in the World Health Organization (WHO) General Program of
Work 2019/2023 and in the European Program of Work (EPW). The article highlighted how
the accuracy and management of some chronic dis-eases can be improved by supporting
real-time analysis for diagnostic and predictive purposes.
Intelligent Agents:
An intelligent agent is an autonomous entity which act upon an environment using sensors and
actuators for achieving goals. An intelligent agent may learn from the environment to achieve
their goals. A thermostat is an example of an intelligent agent.
Following are the main four rules for an AI agent:
o Rule 1: An AI agent must have the ability to perceive the environment.
o Rule 2: The observation must be used to make decisions.
o Rule 3: Decision should result in an action.
o Rule 4: The action taken by an AI agent must be a rational action.
Types of Environment 1
There are different sorts of environments, which affect what an agent has to be able to cope
with.
In designing agents, one should always consider the pair of agent and environment together.
• Fully Observable vs. Partially Observable: If an agent’s sensors give it full access to the
complete state of the environment, the environment is fully observable, otherwise it is only
partially observable or unobservable.
• Deterministic vs. Stochastic: If the next state of the environment is completely determined by
the current state and the agent’s selected action, the environment is deterministic. An
environment may appear stochastic if it is only partially observable.
Types of Environment 2
• Episodic vs. Sequential: If future decisions do not depend on the actions an agent has taken,
just the information from its sensors about the state it is in, then the environment is episodic.
• Static vs. Dynamic: If the environment can change while the agent is deciding what to do, the
environment is dynamic.
Types of Environment 3
• Discrete vs. Continuous: If the sets of percept’s and actions available to the agent are finite,
and the individual elements are distinct and well-defined, then the environment is discrete.
• Single Agent vs. Multiagent: Must other entities in the environment be modelled as agents?
Are they cooperative or competitive?
Rationality:
The rationality of an agent is measured by its performance measure. Rationality can be judged
on the basis of following points:
o Performance measure which defines the success criterion.
o Agent prior knowledge of its environment.
o Best possible actions that an agent can perform.
o The sequence of precepts.
Structure of an AI Agent
The task of AI is to design an agent program which implements the agent function. The
structure of an intelligent agent is a combination of architecture and agent program. It can be
viewed as:
Agent = Architecture + Agent program
Following are the main three terms involved in the structure of an AI agent:
Architecture: Architecture is machinery that an AI agent executes on.
Agent Function: Agent function is used to map a percept to an action.
f:P* → A
Agent program: Agent program is an implementation of agent function. An agent program
executes on the physical architecture to produce function f.
At its core, an AI agent is made up of four components: the environment, sensors, actuators,
and the decision-making mechanism.
1. Environment
The environment refers to the area or domain in which an AI agent operates. It can be a physical
space, like a factory floor, or a digital space, like a website.
2. Sensors
Sensors are the tools that an AI agent uses to perceive its environment. These can be cameras,
microphones, or any other sensory input that the AI agent can use to understand what is
happening around it.
3. Actuators
Actuators are the tools that an AI agent uses to interact with its environment. These can be
things like robotic arms, computer screens, or any other device the AI agent can use to change
the environment.
4. Decision-making mechanism
A decision-making mechanism is the brain of an AI agent. It processes the information gathered
by the sensors and decides what action to take using the actuators. The decision-making
mechanism is where the real magic happens.
AI agents use various decision-making mechanisms, such as rule-based systems, expert systems,
and neural networks, to make informed choices and perform tasks effectively.
UNIT 2
Here, the error term is squared and thus more sensitive to outliers as compared to Mean
Absolute Error (MAE).
Thus, MSE = 1/4 * (|5-4.8|^2+|10-10.6|^2+|15-14.3|^2+|20-20.1|^2) = 0.225
Since MSE includes squared error terms, we take the square root of the MSE, which gives rise to
Root Mean Squared Error (RMSE).
Thus, RMSE = (0.225)^0.5 = 0.474
R-Squared
R-squared is calculated by dividing the sum of squares of residuals (SSres) from the regression
model by the total sum of squares (SStot) of errors from the average model and then subtract it
from 1.
R-squared is also known as the Coefficient of Determination. It explains the degree to which
the input variables explain the variation of the output / predicted variable.
A R-squared value of 0.81, tells that the input variables explains 81 % of the variation in the
output variable. The higher the R squared, the more variation is explained by the input variables
and better is the model.
Although, there exists a limitation in this metric, which is solved by the Adjusted R-squared.
Performance Metrics for Classification
Classification is the problem of identifying to which of a set of categories/classes a new
observation belongs, based on the training set of data containing records whose class label is
known. Following are the performance metrics used for evaluating a classification model:
Accuracy
Precision and Recall
Specificity
F1-score
AUC-ROC
To understand different metrics, we must understand the Confusion matrix. A confusion matrix
is a table that is often used to describe the performance of a classification model (or "classifier")
on a set of test data for which the true values are known.
TN- True negatives (actual 0 predicted 0) &
TP- True positives (actual 1 predicted 1)
FP- False positives (actual 0 predicted 1) &
FN- False Negatives (actual 1 predicted 0)
Consider the following values for the
confusion matrix-
True negatives (TN) = 300
True positives (TP) = 500
False negatives (FN) = 150
False positives (FP) = 50
Accuracy
Accuracy is defined as the ratio of the number of correct predictions and the total number of
predictions. It lies between [0,1]. In general, higher accuracy means a better model (TP and TN
must be high).
However, accuracy is not a useful metric in case of an imbalanced dataset (datasets with uneven
distribution of classes). Say we have a data of 1000 patients out of which 50 are having cancer
and 950 not, a dumb model which always predicts as no cancer will have the accuracy of 95%,
but it is of no practical use since in this case, we want the number of False Negatives as a
minimum. Thus, we have different metrics like recall, precision, F1-score etc.
Thus, Accuracy using above values will be (500+300)/(500+50+150+300) = 800/1000 = 80%
Recall is a useful metric in case of cancer detection, where we want to minimize the number of
False negatives for any practical use since we don't want our model to mark a patient suffering
from cancer as safe. On the other hand, predicting a healthy patient as cancerous is not a big
issue since, in further diagnosis, it will be cleared that he does not have cancer. Recall is also
known as Sensitivity.
Thus, Recall using above values will be 500/(500+150) = 500/650 = 76.92%
Precision is useful when we want to reduce the number of False Positives. Consider a system
that predicts whether the e-mail received is spam or not. Taking spam as a positive class, we do
not want our system to predict non-spam e-mails (important e-mails) as spam, i.e., the aim is to
reduce the number of False Positives.
Thus, Precision using above values will be 500/(500+50) = 500/550 = 90.90%
F1-score
F1-score is a metric that combines both Precision and Recall and equals to the harmonic mean
of precision and recall. Its value lies between [0,1] (more the value better the F1-score).
Using values of precision=0.9090 and recall=0.7692, F1-score = 0.8333 = 83.33%
Data pre-processing
Data pre-processing is the process of transforming raw data into an understandable format. It is
also an important step in data mining as we cannot work with raw data. The quality of the data
should be checked before applying machine learning or data mining algorithms.
Why is Data Pre-processing Important?
Pre-processing of data is mainly to check the data quality. The quality can be checked by the
following:
Accuracy: To check whether the data entered is correct or not.
Completeness: To check whether the data is available or not recorded.
Consistency: To check whether the same data is kept in all the places that do or do not
match.
Timeliness: The data should be updated correctly.
Believability: The data should be trustable.
Interpretability: The understandability of the data.
Data Understanding
Data Understanding involves several key activities, including reviewing the data, identifying
any problems or inconsistencies in the data, and determining the appropriate techniques for
cleaning and pre-processing the data.
During this phase, the data analyst must also identify any missing values or outliers and
decide on the best way to handle them.
This is an important step in ensuring that the data is suitable for analysis and that the results
are accurate and reliable.
One of the benefits of Data Understanding is that it allows the data analyst to identify any
potential biases or limitations in the data that may impact the results of the analysis.
For example, if the data is biased towards a particular group or if it contains a large number
of missing values, this can skew the results of the analysis.
By identifying these issues early on in the process, the data analyst can take the necessary
steps to address them and ensure that the data is of the highest quality.
Another benefit of Data Understanding is that it allows the data analyst to gain a deeper
understanding of the data and to identify any relationships or patterns that may be of
interest.
For example, by exploring the data and examining the relationships between different
variables, the data analyst may be able to identify important insights or trends that can be
used to inform the analysis.
This can lead to more accurate and meaningful results, which can be used to make better
decisions and drive business success.
Neural networks:
Neural network is the fusion of artificial intelligence and brain-inspired design that reshapes
modern computing.
Neural networks mimic the basic functioning of the human brain and are inspired by how
the human brain interprets information.
There are different types of neural networks, from feedforward to recurrent and
convolutional, each tailored for specific tasks.
They solve various real-time tasks because of its ability to perform computations quickly and
its fast responses.
UNIT 3
Introduction to CNN
A Convolutional Neural Network (CNN) is a deep learning architecture designed for image
analysis and recognition.
It employs specialized layers to automatically learn features from images, capturing patterns
of increasing complexity.
These features are then used to classify objects or scenes.
CNNs have revolutionized computer vision tasks, exhibiting high accuracy and efficiency in
tasks like image classification, object detection, and image generation.
The fundamental principle of Convolutional Neural Networks (CNNs) is hierarchical feature
learning.
CNNs process input data, often images, by applying a series of convolutional and pooling
layers.
Convolutional layers employ small filters to convolve across the input, detecting spatial
patterns.
Pooling layers down sample the output, retaining important information. This enables the
network to progressively learn hierarchical features, from simple edges to complex object
parts.
The learned features are then used for classification or other tasks.
CNNs’ ability to automatically learn and abstract features from data has made them
exceptionally effective in image analysis, with applications spanning various fields.
Components of CNN
The CNN is made up of three types of layers: convolutional layers, pooling layers, and fully-
connected (FC) layers.
Convolution Layers
This is the very first layer in the CNN that is responsible for the extraction of the different
features from the input images. The convolution mathematical operation is done between the
input image and a filter of a specific size MxM in this layer.
The Fully Connected
The Fully Connected (FC) layer comprises the weights and biases together with the neurons and
is used to connect the neurons between two separate layers. The last several layers of a CNN
Architecture are usually positioned before the output layer.
Pooling layer
The Pooling layer is responsible for the reduction of the size(spatial) of the Convolved
Feature. This decrease in the computing power is being required to process the data by a
significant reduction in the dimensions.
There are two types of pooling
1 average pooling
2 max pooling.
A Pooling Layer is usually applied after a Convolutional Layer. This layer’s major goal is to lower
the size of the convolved feature map to reduce computational expenses. This is accomplished
by reducing the connections between layers and operating independently on each feature map.
There are numerous sorts of Pooling operations, depending on the mechanism utilised.
The largest element is obtained from the feature map in Max Pooling. The average of the
elements in a predefined sized Image segment is calculated using Average Pooling. Sum Pooling
calculates the total sum of the components in the predefined section. The Pooling Layer is
typically used to connect the Convolutional Layer and the FC Layer.
Dropout
To avoid overfitting (when a model performs well on training data but not on new data), a
dropout layer is utilised, in which a few neurons are removed from the neural network during
the training phase, resulting in a smaller model.
Activation Functions
They’re utilised to learn and approximate any form of network variable-to-variable association
that’s both continuous and complex.
It gives the network non-linearity. The ReLU, Softmax, and tanH are some of the most often
utilised activation functions.
Basic Architecture
There are two main parts to a CNN architecture
A convolution tool that separates and identifies the various features of the image for
analysis in a process called as Feature Extraction.
The network of feature extraction consists of many pairs of convolutional or pooling
layers.
A fully connected layer that utilizes the output from the convolution process and
predicts the class of the image based on the features extracted in previous stages.
This CNN model of feature extraction aims to reduce the number of features present in a
dataset. It creates new features which summarises the existing features contained in an
original set of features. There are many CNN layers as shown in the CNN architecture
diagram.
1. Convolutional Layer
This layer is the first layer that is used to extract the various features from the input images. In
this layer, the mathematical operation of convolution is performed between the input image
and a filter of a particular size MxM. By sliding the filter over the input image, the dot product is
taken between the filter and the parts of the input image with respect to the size of the filter
(MxM).
The output is termed as the Feature map which gives us information about the image such as
the corners and edges. Later, this feature map is fed to other layers to learn several other
features of the input image.
The convolution layer in CNN passes the result to the next layer once applying the convolution
operation in the input. Convolutional layers in CNN benefit a lot as they ensure the spatial
relationship between the pixels is intact.
2. Pooling Layer
In most cases, a Convolutional Layer is followed by a Pooling Layer. The primary aim of this layer
is to decrease the size of the convolved feature map to reduce the computational costs. This is
performed by decreasing the connections between layers and independently operates on each
feature map. Depending upon method used, there are several types of Pooling operations. It
basically summarises the features generated by a convolution layer.
In Max Pooling, the largest element is taken from feature map. Average Pooling calculates the
average of the elements in a predefined sized Image section. The total sum of the elements in
the predefined section is computed in Sum Pooling. The Pooling Layer usually serves as a bridge
between the Convolutional Layer and the FC Layer.
This CNN model generalises the features extracted by the convolution layer, and helps the
networks to recognise the features independently. With the help of this, the computations are
also reduced in a network.
3. Fully Connected Layer
The Fully Connected (FC) layer consists of the weights and biases along with the neurons and is
used to connect the neurons between two different layers. These layers are usually placed
before the output layer and form the last few layers of a CNN Architecture.
In this, the input image from the previous layers are flattened and fed to the FC layer. The
flattened vector then undergoes few more FC layers where the mathematical functions
operations usually take place. In this stage, the classification process begins to take place. The
reason two layers are connected is that two fully connected layers will perform better than a
single connected layer. These layers in CNN reduce the human supervision
4. Dropout
Usually, when all the features are connected to the FC layer, it can cause overfitting in the
training dataset. Overfitting occurs when a particular model works so well on the training data
causing a negative impact in the model’s performance when used on a new data.
To overcome this problem, a dropout layer is utilised wherein a few neurons are dropped from
the neural network during training process resulting in reduced size of the model. On passing a
dropout of 0.3, 30% of the nodes are dropped out randomly from the neural network.
Dropout results in improving the performance of a machine learning model as it prevents
overfitting by making the network simpler. It drops neurons from the neural networks during
training.
5. Activation Functions
Finally, one of the most important parameters of the CNN model is the activation function. They
are used to learn and approximate any kind of continuous and complex relationship between
variables of the network. In simple words, it decides which information of the model should fire
in the forward direction and which ones should not at the end of the network.
Introduction to Tensorflow Hub
TensorFlow Hub is a library for reusable machine learning modules, where
a module contains a self-contained piece of a TensorFlow Graph along with its weights and
assets. So It could be reused for transfer learning across different tasks.
Very easy to use for there is no need to have a clear understanding about the model
architecture for retraining or inference.
Just only add a small snippet of code in your program to convert it into a fantastic deep
learning application. Sounds cool, right?
Installation
Just pip install the tensorflow_hub package.
pip install tensorflow-hub
ResNet
ResNet (short for “Residual Neural Network”) is a family of deep convolutional neural networks
designed to overcome the problem of vanishing gradients that are common in very deep
networks. The idea behind ResNet is to use “residual blocks” that allow for the direct
propagation of gradients through the network, enabling the training of very deep networks.
A residual block consists of two or more convolutional layers followed by an activation function,
combined with a shortcut connection that bypasses the convolutional layers and adds the
original input directly to the output of the convolutional layers after the activation function.
This allows the network to learn residual functions that represent the difference between the
convolutional layers’ input and output, rather than trying to learn the entire mapping directly.
The use of residual blocks enables the training of very deep networks, with hundreds or
thousands of layers, significantly alleviating the issue of vanishing gradients.
Applications of CNN
Convolutional Neural Networks (CNNs) are widely used in various applications such as:
Object Detection: CNN can detect and locate objects in images or videos.
Image Segmentation: CNNs can segment images into different regions and tag each region
with a semantic class.
Video Analytics: CNNs can be used for action detection, object tracking, and video scene
segmentation.
Natural Language Processing: CNNs can be used for text classification, sentiment analysis,
and language translation tasks.
Autonomous Systems: CNNs can be used in autonomous systems such as self-driving cars
for lane detection, obstacle detection, and traffic sign recognition.
Decoding Facial Recognition: One of the main applications of this architecture is facial
recognition. Using this technique, facial images are broken down into multiple components.
The significant components are separating facial features from external features like light or
pose and unique facial features.
Document rendering: The documents, including handwritten materials, can be analysed
using CNN architectures. The error rate of comparison of documents with available content
is reduced to near zero. Thousands of simultaneous commands run to analyse the
handwritten content using CNN, which is very difficult otherwise.
Recognition of Speech: Besides Image processing, neuron networks are also useful for
recognizing speech with a huge range of vocabulary and phonics. Emotional detection using
CNN is also a focus area for researchers.
UNIT 4
All the inputs and outputs in standard neural networks are independent of one another,
however in some circumstances, such as when predicting the next word of a phrase, the
prior words are necessary, and so the previous words must be remembered.
As a result, RNN was created, which used a Hidden Layer to overcome the problem.
The most important component of RNN is the Hidden state, which remembers specific
information about a sequence.
RNNs have a Memory that stores all information about the calculations. It employs the same
settings for each input since it produces the same outcome by performing the same task on
all inputs or hidden layers.
Advantages of RNNs:
Handle sequential data effectively, including text, speech, and time series.
Process inputs of any length, unlike feedforward neural networks.
Share weights across time steps, enhancing training efficiency.
Disadvantages of RNNs:
Prone to vanishing and exploding gradient problems, hindering learning.
Training can be challenging, especially for long sequences.
Computationally slower than other neural network architectures.
LSTM Architecture
At a high level, LSTM works very much like an RNN cell. Here is the internal functioning of the
LSTM network. The LSTM network architecture consists of three parts, as shown in the image
below, and each part performs an individual function.
The first part chooses whether the information coming from the previous timestamp is to be
remembered or is irrelevant and can be forgotten.
In the second part, the cell tries to learn new information from the input to this cell.
At last, in the third part, the cell passes the updated information from the current
timestamp to the next timestamp. This one cycle of LSTM is considered a single-time step.
These three parts of an LSTM unit are known as gates.
They control the flow of information in and out of the memory cell or lstm cell.
The first gate is called Forget gate, the second gate is known as the Input gate, and the last
one is the Output gate.
An LSTM unit that consists of these three gates and a memory cell or lstm cell can be
considered as a layer of neurons in traditional feedforward neural network, with each
neuron having a hidden layer and a current state.
UNIT 5
BERT Example
The bidirectionality of a model is important for truly understanding the meaning of a
language. Let’s see an example to illustrate this. There are two sentences in this example
and both of them involve the word “bank”: