Machine Learning Techniques Course Outline
Machine Learning Techniques Course Outline
BCAI-601, BCDS-062
rd
[Link]-3 Year
Machine
Learning
Techniques
Course Content
for
Machine Learning Techniques
(BCAI601, BCD-062)
B. Tech 3rd Year
CSE(AI)/CSE(AI & ML)/CSE (DS)/ CSE(IoT)
i
Vision of Institute
Mission of Institute
The mission of the institute is to educate young aspirants in various technical fields
to fulfill global requirement of human resources by providing sustainable quality
education, training and invigorating environment besides molding them into skilled
competent and socially responsible citizens who will lead the building of a powerful
nation.
ii
EVALUATION SCHEME
* The Mini Project or Internship (4 weeks) will be done during summer break after VI Semester and will be
assessed during VII semester.
* It is desirable that the students should do their Summer Internship or Mini Project in their specialization
area in line with the [Link]. program.
iii
SEMESTER- VI
Departmental Elective-I
1. BCAI051 - Mathematical Foundation AI, ML and Data Science
2. BCS058 - Data Warehouse & Data Mining
3. BCS052 - Data Analytics
4. BCS054 - Object Oriented System Design with C++
Departmental Elective-II
1. BCAM05 l - Cloud Computing
2. BCAI052 - Natural Language Processing
3. BCS056 -Application of Soft Computing
4. BCS057- Image Processing
Departmental Elective-ill
1. BCAI06 l - Cyber Forensic analytics
2. BCDS061 - Image Analytics
3. BCAM061 - Social Media Analytics and Data Analysis
4. BCAM062 - Stream Processing and Analytics
iv
BCAI601 MACHINE LEARNING TECHNIQUES
Course Outcome (CO) Bloom's Knowledge Level (KL)
To understand the need for machine learning for various problem solving K1, K2
COl
CO2 To understand a wide variety of learning algorithms and how to evaluate models K1, K3
generated from data
To understand the latest trends in machine learning K2, K3
CO3
To design appropriate machine learning algorithms and apply the algorithms to a K4, K6
CO4 real-world problem
CO5 To optimize the models learned and report on the expected accuracy that can be K4, K5
achieved by applying the models
3-0-0
DETAILED SYLLABUS
Unit Topic Proposed
Lecture
I INTRODUCTION - Learning, Types of Learning, Well defined learning problems,
Designing a Learning System, History of ML, Introduction of Machine Leaming
Approaches - (Artificial Neural Network, Clustering, Reinforcement Leaming, 08
Decision Tree Learning, Bayesian networks, Support Vector Machine, Genetic
Algorithm), Issues in Machine Leaming and Data Science Vs Machine Leaming;
II REGRESSION: Linear Regression and Logistic Regression
BAYESIAN LEARNING- Bayes theorem, Concept learning, Bayes Optimal
Classifier, Na1ve Bayes classifier, Bayesian belief networks, EM algorithm.
SUPPORT VECTOR MACHINE: Introduction, Types of support vector kernel - 08
(Linear kernel, polynomial kernel and Gaussian kernel), Hyperplane - (Decision
surface), Properties of SVM, and Issues in SVM.
III DECISION TREE LEARNING - Decision tree learning algorithm, Inductive bias,
Inductive inference with decision trees, Entropy and information theory,
Information gain, ID-3 Algorithm, Issues in Decision tree learning. 08
INSTANCE-BASED LEARNING - k-Nearest Neighbour Leaming, Locally
Weighted Regression, Radial basis function networks, Case-based learning
IV ARTIFICIAL NEURAL NETWORKS - Perceptron's, Multilayer perceptron,
Gradient descent and the Delta rule, Multilayer networks, Derivation of
Backpropagation Algorithm, Generalization, Unsupervised Leaming - SOM
Algorithm and its variant; 08
DEEP LEARNING - Introduction, concept of convolutional neural network, Types
of layers - (Convolutional Layers, Activation function, pooling, fully connected),
Concept of Convolution (1D and 2D) layers, Training of network, Case study of
CNN for eg., on Diabetic Retinopathy, Building a smart speaker, Self-deriving car
etc.
v
V REINFORCEMENT LEARNING-Introduction to Reinforcement Leaming,
Leaming Task, Example of Reinforcement Learning in Practice, Learning Models
for Reinforcement - (Markov Decision process, Q Leaming - Q Leaming function,
08
Q Leaming Algorithm), Application of Reinforcement Learning, Introduction to
Deep Q Learning.
GENETIC ALGORITHMS: Introduction, Components, GA cycle ofreproduction,
Crossover, Mutation, Genetic Programming, Models of Evolution and Leaming,
Applications.
vi
Meerut Institute of Engineering & Technology, Meerut
Lesson Plan / Teaching Plan / Lecture Plan with Progress:
B. Tech – VI Semester: 2024-25
Topics / lectures are arranged in sequence - same - as to be taught in the class. Maintain data related to "Date"
in its hard copy.
Reference &
Lecture Teaching
S. No
CO Topic Description
Pedagogy
Material
No (No)
.
White
1 1 CO1 Learning, Types of Learning, Well defined learning
Board
1,2
problems
White
2 2 CO1 Designing a Learning System,
Board
History of ML White
3 3 CO1
Board
1,2
PPT, White
4 4 CO1 Introduction of Machine Learning Approaches –
Board
1
Artificial Neural Network,
Clustering, Reinforcement Learning
White
5 5 CO1
Board
1
PPT, White
6 6 CO1 Decision Tree Learning, Bayesian networks,
Board
1,2
PPT, White
10 10 CO2 Bayes theorem, Concept learning, Bayes Optimal
Board
1
Classifier
PPT, White
11 11 CO2 Naïve Bayes classifier, Bayesian belief networks
Board
1,2
PPT, White
12 12 CO2 EM algorithm
Board
1,2,3
PPT, White
13 13 CO2 SUPPORT VECTOR MACHINE: Introduction
Board
1,2
PPT, White
15 15 CO2 Hyperplane – (Decision surface)
Board
1,2
vii
PPT, White
16 16 CO2 Properties of SVM, and Issues in SVM
Board
1,2,3
PPT, White
17 17 CO3 Decision tree learning algorithm
Board
1,2,3
PPT, White
18 18 CO3 Inductive bias, Inductive inference with decision
Board
1,2,3
trees
PPT, White
19 19 CO3 Entropy and information theory, Information gain
Board
1,2
PPT, White
20 20 CO3 ID-3 Algorithm, Issues in Decision tree learning
Board
1,2,3
PPT, White
21 21 CO3 k-Nearest Neighbour Learning
Board
1,2,3
PPT, White
22 22 CO3 Locally Weighted Regression
Board
1,2
PPT, White
23 23 CO3 Radial basis function networks
Board
1,2,3
PPT, White
24 24 CO3 Case-based learning
Board
1,2,3
PPT, White
25 25 CO4 Perceptron’s, Multilayer perceptron
Board
1
PPT, White
28 28 CO4 Derivation of Backpropagation Algorithm
Board
1,2
Generalization, Unsupervised Learning – SOM PPT, White
29 29 CO4
Algorithm and its variant Board
1,2
PPT, White
30 30 CO4 Introduction, concept of convolutional neural
Board
1,2,3
network, Types of layers – (Convolutional Layers,
Activation function, pooling, fully connected)
Concept of Convolution (1D and 2D) layers,
PPT, White
31 31 CO4 Training of network
Board
1,2
viii
PPT, White
38 38 CO5 GA cycle of reproduction, Crossover
Board
1,2
Mutation, Genetic Programming PPT, White
39 39 CO5
Board
1,2
Models of Evolution and Learning, PPT, White
40 40 CO5
Applications Board
1,2
Industrial Applications & Case Studies PPT, White
41 41 CO5
Board
1,2
ix
Table of Content
2
2.9 EM Algorithm in Machine Learning.................................................................................................... 84
2.9.1 What is an EM algorithm? ............................................................................................................ 84
2.9.2 EM Algorithm ............................................................................................................................... 85
2.9.2 What is Convergence in the EM algorithm? ................................................................................ 86
2.9.3 Steps in EM Algorithm ................................................................................................................. 86
2.9.3 Gaussian Mixture Model (GMM) ................................................................................................. 87
2.9.4 Applications of EM algorithm ...................................................................................................... 87
2.10 Support Vector Machine Algorithm ................................................................................................... 89
2.11 Types of SVM .................................................................................................................................... 91
2.11.1 Kernel Method in SVMs ............................................................................................................. 92
2.11.2 Major Kernel Function in Support Vector Machine ................................................................... 93
2.12 Hyperplane and Support Vectors in the SVM algorithm: ................................................................. 96
2.12.1 Support Vectors: .......................................................................................................................... 96
2.13 Properties of SVM ........................................................................................................................... 100
2.13.1 The Disadvantages of Support Vector Machine (SVM) are: ................................................... 100
2.14 Important Questions (Previous Year Question) ............................................................................... 102
UNIT 3 – Decision Tree Learning ........................................................................................................... 103
3.1 Decision Tree Classification Algorithm ............................................................................................ 103
3.1.1 Why use Decision Trees? ............................................................................................................ 104
3.1.2 How does the Decision Tree algorithm Work? ........................................................................... 105
3.1.3 Attribute Selection Measures ...................................................................................................... 106
3.2 Inductive Bias .................................................................................................................................... 108
3.3 Inductive inference with decision tree ............................................................................................... 108
3.4 What is Inductive Learning Algorithm? ............................................................................................ 108
3.5 Entropy and Information Gain ........................................................................................................... 110
3.6 What is Information Gain?................................................................................................................. 110
3.7 Key Differences between Entropy and Information Gain ................................................................. 111
3.8 ID3 Algorithm .................................................................................................................................... 113
3.9 k-NN Learning ................................................................................................................................... 125
3.10 Locally Weighted Regression .......................................................................................................... 128
3.11 Radial Basis Function Networks ...................................................................................................... 132
3.11.1 What are Radial Basis Functions? ............................................................................................. 132
3.11.2 How Do RBF Networks Work? ................................................................................................ 132
3
3.11.3 Key Characteristics of RBFs ..................................................................................................... 132
3.11.4 Architecture of RBF Networks ................................................................................................. 133
3.11.5 Training Process of radial basis function neural network ......................................................... 133
3.12 Case-Based Learning ....................................................................................................................... 135
3.12.1 Challenges with CBR ................................................................................................................ 135
3.13 Important Questions (PYQs)............................................................................................................ 136
UNIT 4- Artificial Neural Networks........................................................................................................ 137
4.1 Perceptron in Machine Learning ........................................................................................................ 137
4.1.1 What is the Perceptron model in Machine Learning? ................................................................. 137
4.1.2 What is Binary classifier in Machine Learning? ......................................................................... 138
4.1.3 Basic Components of Perceptron ................................................................................................ 138
4.2 How does Perceptron work? .............................................................................................................. 139
4.3 Types of Perceptron Models .............................................................................................................. 141
4.3.1 Advantages of Multi-Layer Perceptron: ...................................................................................... 142
4.3.2 Disadvantages of Multi-Layer Perceptron: ................................................................................. 142
4.4 Gradient Descent in Machine Learning ............................................................................................. 144
4.4.1 What is Gradient Descent or Steepest Descent? ......................................................................... 144
4.4.2 What is Cost-function? ................................................................................................................ 145
4.4.3 How does Gradient Descent work? ............................................................................................. 146
4.4.4 Direction & Learning Rate .......................................................................................................... 147
4.4.5 Learning Rate: ............................................................................................................................. 147
4.5.6 Types of Gradient Descent .......................................................................................................... 147
4.5 Multilayer Networks .......................................................................................................................... 151
4.5.1 Formula for Multi-Layered Neural Network .............................................................................. 152
4.6 Derivation of Backpropagation ......................................................................................................... 153
4.6.1 Notation ....................................................................................................................................... 154
4.7 Review of Calculus Rules .................................................................................................................. 155
4.7.1 Gradient Descent on Error........................................................................................................... 155
4.7.2 Derivative of the error with respect to the activation .................................................................. 156
4.7.3 Derivative of the activation with respect to the net input ........................................................... 156
4.7.4 Derivative of the net input with respect to a weight ................................................................... 156
4.7.5 Weight change rule for a hidden to output weight ...................................................................... 157
4.7.6 Weight change rule for an input to hidden weight ...................................................................... 157
4
4.8 Generalization .................................................................................................................................... 162
4.8.1 Difference Between Memorization and Generalization .............................................................. 162
4.8.2 Generalization vs. Overfitting ..................................................................................................... 162
4.8.3 Theoretical Foundations of Generalization ................................................................................. 162
4.9 Self Organizing Maps ........................................................................................................................ 162
4.9.1 How do SOM works? .................................................................................................................. 163
4.9.2 Algorithm .................................................................................................................................... 164
4.10 Convolutional Neural Network (CNN) ............................................................................................ 165
4.10.1 The importance of CNNs .......................................................................................................... 165
4.10.2 Inspiration Behind CNN and Parallels With The Human Visual System ................................. 165
4.10.3 Key Components of a CNN ...................................................................................................... 167
4.11 Convolution layers ........................................................................................................................... 168
4.11.1 Do we have to manually find these weights? ............................................................................ 169
4.11.2 Activation function .................................................................................................................... 170
4.12.3 Pooling layer ............................................................................................................................. 170
4.11.4 Fully connected layers ............................................................................................................... 171
4.11.5 Overfitting and Regularization in CNNs ................................................................................... 171
4.11.6 Seven strategies to mitigate overfitting in CNNs ...................................................................... 172
4.11.7 Practical Applications of CNNs ................................................................................................ 173
4.12 Important Questions (PYQs)............................................................................................................ 174
UNIT 5 - Reinforcement Learning .......................................................................................................... 175
5.1 What is Reinforcement Learning? ..................................................................................................... 175
5.2 Terms used in Reinforcement Learning ............................................................................................. 175
5.3 Key Features of Reinforcement Learning .......................................................................................... 176
5.4 Approaches to implement Reinforcement Learning .......................................................................... 176
5.4.1 Value-based: ................................................................................................................................ 176
5.4.2 Policy-based: ............................................................................................................................... 177
5.5 Elements of Reinforcement Learning ................................................................................................ 177
5.6 Reinforcement learning. ..................................................................................................................... 178
5.7 Differentiate between reinforcement and supervised learning. ......................................................... 178
5.8 Types of reinforcement learning: ....................................................................................................... 180
5.9 Different machine learning task. ........................................................................................................ 181
5.10 Reinforcement learning with the help of an example. ..................................................................... 183
5
5.10.1 Working of Reinforcement Learning: ....................................................................................... 183
5.10.2 Terms used in reinforcement learning method. ......................................................................... 183
5.10.3 Approaches used to implement reinforcement learning algorithm. .......................................... 184
5.11 Learning models of reinforcement learning ................................................................................. 185
5.11. 1 Challenges of reinforcement learning ...................................................................................... 185
5.12 Q-learning .................................................................................................................................... 185
5.12.1 Q- Learning algorithm ......................................................................................................... 186
5.13 Application of reinforcement learning ............................................................................................. 187
5.14 Describe deep Q-learning. ............................................................................................................... 187
5.14.1 Steps involved in reinforcement learning using deep Q-learning networks: ............................ 187
5.14.2 Pseudo code for deep Q-learning. ............................................................................................. 188
5.15 Genetic algorithm......................................................................................................................... 189
5.15.1 Procedure of Genetic algorithm: ............................................................................................... 189
5.15.2 Advantages of genetic algorithm: ............................................................................................. 189
5.15.3 Disadvantages of Genetic algorithm: ........................................................................................ 189
5.16 Cycle of genetic algorithm ........................................................................................................... 190
5.17 Mutation ....................................................................................................................................... 191
5.18 Genetic Programming ..................................................................................................................... 192
5.18.1 Key Components of Genetic Programming .............................................................................. 192
5.19 Types of encoding in Genetic Algorithm......................................................................................... 193
5.20 The various methods of selecting .................................................................................................... 194
5.21 Roulette-wheel based on fitness v/s Roulette-wheel based on rank ................................................ 196
5.21.1 Roulette-wheel based on fitness ................................................................................................ 196
5.21.2 Roulette-wheel based on Rank .................................................................................................. 196
5.22 Applications of genetic algorithms .................................................................................................. 196
5.23 Industrial Application ...................................................................................................................... 197
5.23.1 Optimization of travelling salesman problem using genetic algorithm .................................... 197
15.23.2 Convergence of genetic algorithm .......................................................................................... 197
5.24 Important Questions ......................................................................................................................... 198
6
UNIT 1 – Introduction to ML
1. INTRODUCTION
Lecture: 1
1.1 Machine learning
(ML) is a subdomain of artificial intelligence (AI) that focuses on developing systems that
learn—or improve performance—based on the data they ingest. Artificial intelligence is a
broad word that refers to systems or machines that resemble human intelligence. Machine
learning and AI are frequently discussed together, and the terms are occasionally used
interchangeably, although they do not signify the same thing. A crucial distinction is that,
while all machine learning is AI, not all AI is machine learning.
In this topic, we will provide a detailed description of the types of Machine Learning along
with their respective algorithms.
8
Let's understand supervised learning with an example. Suppose we have an input
dataset of cats and dog images. So, first, we will provide the training to the machine to
understand the images, such as the shape & size of the tail of cat and dog, Shape of eyes,
colour, height (dogs are taller, cats are smaller), etc. After completion of training, we
input the picture of a cat and ask the machine to identify the object and predict the
output. Now, the machine is well trained, so it will check all the features of the object,
such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will
put it in the Cat category. This is the process of how the machine identifies the objects
in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x)
with the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.
Categories of Supervised Machine Learning
Supervised machine learning can be classified into two types of problems, which are given
below:
o Classification
o Regression
• Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:
• Random Forest Algorithm
• Decision Tree Algorithm
• Logistic Regression Algorithm
• Support Vector Machine Algorithm
• Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
Some popular Regression algorithms are given below:
9
• Lasso Regression
Advantages:
1. Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
2. These algorithms are helpful in predicting the output on the basis of prior
experience.
Disadvantages:
1. These algorithms are not able to solve complex tasks.
2. It may predict the wrong output if the test data is different from the training data.
3. It requires lots of computational time to train the algorithm.
Clustering
The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the
most similarities remain in one group and have fewer or no similarities with the objects of
other groups. An example of the clustering algorithm is grouping the customers by their
purchasing behaviour.
Some of the popular clustering algorithms are given below:
o K-Means Clustering algorithm
o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis
Association
Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset. The main aim of this learning algorithm
is to find the dependency of one data item on another data item and map those variables
accordingly so that it can generate maximum profit. This algorithm is mainly
applied in Market Basket analysis, Web usage mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.
Advantages and Disadvantages of Unsupervised Learning
Algorithm
Advantages:
11
o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
12
We can imagine these algorithms with an example. Supervised learning is where a
student is under the supervision of an instructor at home and college. Further, if that
student is self-analysing the same concept without any help from the instructor, it comes
under unsupervised learning. Under semi-supervised learning, the student has to revise
himself after analyzing the same concept under the guidance of an instructor at college.
Disadvantages:
o Iterations results may not be stable.
o We cannot apply these algorithms to network-level data.
o Accuracy is low.
13
1.4.4 Categories of Reinforcement Learning
Reinforcement learning is categorized mainly into two types of methods/algorithms:
o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing
the tendency that the required behaviour would occur again by adding something. It
enhances the strength of the behaviour of the agent and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour would
occur again by avoiding the negative condition.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial
and manufacturing area, and these robots are made more powerful with reinforcement
learning. There are different industries that have their vision of building intelligent
robots using AI and Machine learning technology .
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the
help of Reinforcement Learning by Salesforce company.
I. Advantages
o It helps in solving complex real-world problems which are difficult to be solved by
general techniques.
o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
o Helps in achieving long term results.
II. Disadvantage
o RL algorithms are not preferred for simple problems.
o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can
weaken the results.
o The curse of dimensionality limits reinforcement learning for real physical systems.
14
Lecture: 2
15
1.5.1 Following are the qualities that you need to keep in mind while
Designing a learning system:
i. Reliability
The system must be capable of carrying out the proper task at the appropriate degree of
performance in a given setting. Testing the dependability of ML systems that learn
from data is challenging because a system's failure needs not result in an error; instead, it
could simply produce garbage results, meaning that some results were produced even
though the system had not been trained with the corresponding ground truth.
When a typical system fails, you receive an error message, such as the crew is addressing
a technical issue and will return soon.
When a machine learning (ML) system fails, it usually does so without being seen. For
instance, when translating from English to Hindi or vice versa, even if the model may not
have seen all of the words, it may nevertheless provide a translation that is illogical.
ii. Scalability
There should be practical methods for coping with the system's expansion as it changes (in
terms of data amount, traffic volume, or complexity). Because certain essential applications
might lose millions of dollars or their credibility with just one hour of outage or
failure, there should be an automated provision to grow computing and storage capacity.
For instance, if a feature on an e-commerce website fails to function as planned on a busy
day, it might result in a loss of millions of dollars in sales.
iii. Maintainability
The performance of the model may fluctuate as a result of changes in data distribution
over time. In the ML system, there should be a provision to first determine whether there
is any model drift or data drift, and once the major drift is noticed, how to re-train/re-
fresh and enable new ML models without interfering with the ML system's present
functioning.
iv. Adaptability
The availability of fresh data with increased features or changes in business objectives,
such as conversion rate vs. customer engagement time for e-commerce, are the other
changes that occur most frequently in machine learning (ML) systems. As a result, the
system has to be adaptable to fast upgrades without causing any service disruptions.
16
Data
i. For example, human age and height have expected value ranges, but they can't be too
huge, like age value 150+, height - 10 feet, etc. Feature expectations are recorded in
a schema - ranges of the feature values carefully captured to avoid any unanticipated value,
which can result in a trash answer.
ii. All features are advantageous; features introduced to the system should be valuable in
some way, such as being a predictor or an identifier, as each feature has a handling
cost.
iii. No feature should cost more than it is worth; each new feature should be evaluated in
terms of cost vs. benefits in order to eliminate those that would be difficult to implement
or manage.
iv. The data pipeline has the necessary privacy protections in place; for instance, personally
identifiable information (PII) should be managed carefully because any leaking of sensitive
information may have legal repercussions.
v. If any new external component has an influence on the system, it will be easier to introduce
new features to boost system performance.
vi. All input feature code, including one-hot encoding/binning features and the handling
of unseen levels in one-hot encoded features, must be checked in order to avoid any
intermediate values from departing from the desired range.
Model
1. Model specifications are evaluated and submitted; for quicker re-training, correct
versioning of the model learning code is required.
2. Correlation between offline and online metrics: Model metrics (log loss, mape, mse) should
be strongly associated with the application's goal, such as revenue/cost/time.
3. Hyperparameters like learning rates, the number of layers, the size of the layers, the
maximum depth, and regularization coefficients must be modified for the use case because
the selection of hyperparameter values can significantly affect the accuracy of predictions.
4. To support the most recent model in production, it is important to comprehend how
frequently to retrain models depending on changes in data distribution. Model staleness
has an influence that is known.
5. Simple linear models with high-level characteristics are a good starting point for functional
testing and doing cost-benefit analyses when compared to more complex models.
However, a simpler model is not always better.
6. Model performance must be assessed using adequately representative data to ensure
that model quality is satisfactory on significant data slices.
17
Lecture: 3
1.6 History of Machine Learning
Before some years (about 40-50 years), machine learning was science fiction, but today
it is the part of our daily life. Machine learning is making our day-to-day life easy from
self-driving cars to Amazon virtual assistant "Alexa". However, the idea behind machine
learning is so old and has a long history. Below some milestones are given which have
occurred in the history of machine learning:
1.6.1 The early history of Machine Learning (Pre-1940):
1834: In 1834, Charles Babbage, the father of the computer, conceived a device
that could be programmed with punch cards. However, the machine was never
built, but all modern computers rely on its logical structure.
1936: In 1936, Alan Turing gave a theory that how a machine can determine
and execute a set of instructions.
The era of stored program computers:
1940: In 1940, the first manually operated computer, "ENIAC" was invented,
which was the first electronic general-purpose computer. After that stored
program computer such as EDSAC in 1949 and EDVAC in 1951 were invented.
1943: In 1943, a human neural network was modeled with an electrical circuit. In
1950, the scientists started applying their idea to work and analyzed how human
neurons might work.
1.6.2 Computer machinery and intelligence:
1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery and
Intelligence," on the topic of artificial intelligence. In his paper, he asked, "Can
machines think?"
18
Machine Learning from theory to reality
1959: In 1959, the first neural network was applied to a real-world problem to
remove echoes over phone lines using an adaptive filter.
1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural network
NETtalk, which was able to teach itself how to correctly pronounce 20,000 words
in one week.
1997: The IBM's Deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.
Machine Learning at 21st century
2006:
Geoffrey Hinton and his group presented the idea of profound getting the hang of
utilizing profound conviction organizations.
The Elastic Compute Cloud (EC2) was launched by Amazon to provide scalable
computing resources that made it easier to create and implement machine
learning models.
2007:
Participants were tasked with increasing the accuracy of Netflix's recommendation
algorithm when the Netflix Prize competition began.
Support learning made critical progress when a group of specialists utilized it
to prepare a PC to play backgammon at a top-notch level.
2008:
Google delivered the Google Forecast Programming interface, a cloud-based help
that permitted designers to integrate AI into their applications.
Confined Boltzmann Machines (RBMs), a kind of generative brain organization,
acquired consideration for their capacity to demonstrate complex information
conveyances.
2009:
Profound learning gained ground as analysts showed its viability in different
errands, including discourse acknowledgment and picture grouping.
The expression "Large Information" acquired ubiquity, featuring the difficulties
and open doors related with taking care of huge datasets.
2010:
The ImageNet Huge Scope Visual Acknowledgment Challenge (ILSVRC) was
presented, driving progressions in PC vision, and prompting the advancement of
profound convolutional brain organizations (CNNs).
2011:
On Jeopardy! IBM's Watson defeated human champions., demonstrating the
19
potential of question-answering systems and natural language processing.
2012:
AlexNet, a profound CNN created by Alex Krizhevsky, won the ILSVRC,
fundamentally further developing picture order precision and laying out profound
advancing as a predominant methodology in PC vision.
Google's Cerebrum project, drove by Andrew Ng and Jeff Dignitary, utilized
profound figuring out how to prepare a brain organization to perceive felines
from unlabeled YouTube recordings.
2013:
Ian Goodfellow introduced generative adversarial networks (GANs), which made it
possible to create realistic synthetic data.
Google later acquired the startup DeepMind Technologies, which focused on deep
learning and artificial intelligence.
2014:
Facebook presented the DeepFace framework, which accomplished close human
precision in facial acknowledgment.
AlphaGo, a program created by DeepMind at Google, defeated a world champion
Go player and demonstrated the potential of reinforcement learning in challenging
games.
2015:
Microsoft delivered the Mental Toolbox (previously known as CNTK), an open-
source profound learning library.
The performance of sequence-to-sequence models in tasks like machine
translation was enhanced by the introduction of the idea of attention mechanisms.
2016:
The goal of explainable AI, which focuses on making machine learning models
easier to understand, received some attention.
Google's DeepMind created AlphaGo Zero, which accomplished godlike Go
abilities to play without human information, utilizing just support learning.
2017:
Move learning acquired noticeable quality, permitting pretrained models to be
utilized for different errands with restricted information.
Better synthesis and generation of complex data were made possible by the
introduction of generative models like variational autoencoders (VAEs) and
Wasserstein GANs.
These are only a portion of the eminent headways and achievements in AI during
the predefined period. The field kept on advancing quickly past 2017, with new leap
forwards, strategies, and applications arising.
20
Lecture: 4
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell
nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
1.7.2 Relationship between Biological and artificial neural network:
Dendrites Inputs
Synapse Weights
21
Axon Output
22
Figure 1.0.2: Layered Representation of an ANN
23
I. Parallel processing capability:
Artificial neural networks have a numerical value that can perform more than one task
simultaneously.
II. Storing data on the entire network:
Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't prevent
the network from working.
III. Capability to work with incomplete knowledge:
After ANN training, the information may produce output even with inadequate
data. The loss of performance here relies upon the significance of missing data.
IV. Having a memory distribution:
For ANN is to be able to adapt, it is important to determine the examples and to
encourage the network according to the desired output by demonstrating these
examples to the network. The succession of the network is directly proportional to
the chosen instances, and if the event can't appear to the network in all its aspects,
it can produce false output.
V. Having fault tolerance:
Extortion of one or more cells of ANN does not prohibit it from generating output,
and this feature makes the network fault-tolerance.
Disadvantages of Artificial Neural Network:
I. Assurance of proper network structure:
There is no particular guideline for determining the structure of artificial neural
networks. The appropriate network structure is accomplished through experience,
trial, and error.
II. Unrecognized behavior of the network:
It is the most significant issue of ANN. When ANN produces a testing solution, it does
not provide insight concerning why and how. It decreases trust in the network.
III. Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their
structure. Therefore, the realization of the equipment is dependent.
IV. Difficulty of showing the issue to the network:
ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be resolved
here will directly impact the performance of the network. It relies on the user's abilities.
V. The duration of the network is unknown:
VI. The network is reduced to a specific value of the error, and this value does not give us
optimum results.
24
1.7.4 How do artificial neural networks work?
Artificial Neural Network can be best represented as a weighted directed graph, where the
artificial neurons form the nodes. The association between the neurons outputs and neuron
inputs can be viewed as the directed edges with weights. The Artificial Neural Network
receives the input signal from the external source in the form of a pattern and image in the
form of a vector. These inputs are then mathematically assigned by the notations x(n) for
every n number of inputs.
Afterward, each of the input is multiplied by its corresponding weights ( these weights
are the details utilized by the artificial neural networks to solve a specific problem ). In
general terms, these weights normally represent the strength of the interconnection
between neurons inside the artificial neural network. All the weighted inputs are
summarized inside the computing unit.
If the weighted sum is equal to zero, then bias is added to make the output non-zero or
something else to scale up to the system's response. Bias has the same input, and
weight equals to 1. Here the total of weighted inputs can be in the range of 0 to
positive infinity. Here, to keep the response in the limits of the desired value, a certain
maximum value is benchmarked, and the total of weighted inputs is passed through
the activation function.
The activation function refers to the set of transfer functions used to achieve the
desired output. There is a different kind of the activation function, but primarily either
linear or non-linear sets of functions. Some of the commonly used sets of activation
functions are the Binary, linear, and Tan hyperbolic sigmoidal activation functions.
Let us take a look at each of them in details:
Binary:
25
In binary activation function, the output is either a one or a 0. Here, to accomplish this,
there is a threshold value set up. If the net weighted input of neurons is more than 1, then
the final output of the activation function is returned as one or else the output is returned
as 0.
Sigmoidal Hyperbolic:
The Sigmoidal Hyperbola function is generally seen as an "S" shaped curve. Here the
tan hyperbolic function is used to approximate output from the actual net input. The
function is defined as:
F(x) = (1/1 + exp(-????x))
In this type of ANN, the output returns into the network to accomplish the best-
evolved results internally. The feedback networks feed information back into itself and
are well suited to solve optimization issues. The Internal system error corrections utilize
feedback ANNs.
Feed-Forward ANN:
26
Lecture: 5
1.8 Clustering
Clustering or cluster analysis is a machine learning technique, which groups the unlabeled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a
group that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabeled dataset such as shape, size,
color, behavior, etc., and divides them as per the presence and absence of those similar
patterns.
It is an unsupervised learning method, hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.
The clustering technique is commonly used for statistical data analysis.
Note: Clustering is somewhere similar to the classification algorithm, but the difference
is the type of dataset that we are using. In classification, we work with the labeled data
set, whereas in clustering, we work with the unlabeled dataset.
Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
usage are grouped together. Such as the t-shirts are grouped in one section, and trousers
are at other sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc.,
are grouped in separate sections, so that we can easily find out the things. The
clustering technique also works in the same way. Other examples of clustering are
grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses
of this technique are:
I. Market Segmentation
II. Statistical data analysis
III. Social network analysis
IV. Image segmentation
V. Anomaly detection, etc.
27
The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But
there are also other various approaches of Clustering exist. Below are the main clustering
methods used in Machine learning:
I. Partitioning Clustering
II. Density-Based Clustering
III. Distribution Model-Based Clustering
IV. Hierarchical Clustering
V. Fuzzy Clustering
i. Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known
as the centroid-based method. The most common example of partitioning clustering is the
K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as compared to another cluster centroid.
28
Figure 1.0.2: Partitioning Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected.
This algorithm does it by identifying different clusters in the dataset and connects the areas
of high densities into clusters. The dense areas in data space are divided from each other
by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
29
iii. Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done
by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that
uses Gaussian Mixture Models (GMM).
30
v. Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more
than one group or cluster. Each dataset has a set of membership coefficients, which depend
on the degree of membership to be in a cluster. Fuzzy C-means algorithm is the
example of this type of clustering; it is sometimes also known as the Fuzzy k-means
algorithm.
1.8.2 Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above.
There are different types of clustering algorithms published, but only a few are commonly
used. The clustering algorithm is based on the kind of data that we are using. Such
as, some algorithms need to guess the number of clusters in the given dataset, whereas
some are required to find the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in
machine learning:
i. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of equal
variances. The number of clusters must be specified in this algorithm. It is fast with fewer
computations required, with the linear complexity of O(n).
ii. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on updating
the candidates for centroid to be the center of the points within a given region.
iii. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications
with Noise. It is an example of a density-based model similar to the mean-shift, but with
some remarkable advantages. In this algorithm, the areas of high density are separated by
the areas of low density. Because of this, the clusters can be found in any arbitrary shape.
iv. Expectation-Maximization Clustering using GMM: This algorithm can be used as an
alternative for the k-means algorithm or for those cases where K-means can be failed.
In GMM, it is assumed that the data points are Gaussian distributed.
v. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can be
represented as a tree-structure.
vi. Affinity Propagation: It is different from other clustering algorithms as it does not require
to specify the number of clusters. In this, each data point sends a message between the pair
of data points until convergence. It has O(N2T) time complexity, which is the main
drawback of this algorithm.
31
i. In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets into
different groups.
ii. In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.
iii. Customer Segmentation: It is used in market research to segment the customers based
on their choice and preferences.
iv. In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
v. In Land Use: The clustering technique is used in identifying the area of similar lands use
in the GIS database. This can be very useful to find that for what purpose the particular
land should be used, that means for which purpose it is more suitable.
32
Lecture: 6
33
1.9.1 Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for
the given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it
is easy to understand.
The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
1.9.2 Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
34
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision tree starts
with the root node (Salary attribute by ASM). The root node splits further into the next
decision node (distance from the office) and one leaf node based on the corresponding
labels. The next decision node further gets split into one decision node (Cab facility) and
one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:
35
I. Information Gain:
Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high Gini
index.
It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
Gini index can be calculated using the below formula:
36
1.9.5 Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types
of tree pruning technology used:
I. Cost Complexity Pruning
II. Reduced Error Pruning.
It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
It can be very useful for solving decision-related problems.
It helps to think about all the possible outcomes for a problem.
There is less requirement of data cleaning compared to other algorithms.
37
1.10 Bayesian Belief Network in artificial intelligence
Bayesian belief network is key computer technology for dealing with probabilistic events
and to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian
model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and
it consists of two parts:
Directed Acyclic Graph
The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
38
Each node corresponds to the random variables, and a variable can be continuous or
discrete.
Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in
the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
In the above diagram, A, B, C, and D are random variables represented by the nodes of the
network graph.
If we are considering node B, which is connected with node A by a directed arrow,
then node A is called the parent of Node B.
Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG.
The Bayesian network has mainly two components:
I. Causal Component
II. Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional
probability. So let's first understand the joint probability distribution:
39
1.10.2 Explanation of Bayesian network:
Let's understand the Bayesian network through an example by creating a directed
acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry
has two neighbors David and Sophia, who have taken a responsibility to inform Harry at
work when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm.
Here we would like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called the Harry.
Solution:
The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on alarm
probability.
The network is representing that our assumptions do not directly perceive the burglary
and also do not notice the minor earthquake, and they also not confer before calling.
The conditional distributions for each node are given as conditional probabilities table
or CPT.
Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values
List of all events occurring in this network:
Burglary (B)
Earthquake(E)
Alarm(A)
David Calls(D)
Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
40
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
Let's take the observed probability for the Burglary and earthquake component: P(B=
True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake.
P(E= False)= 0.999, Which is the probability that an earthquake not occurred. We can
provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:
41
Table 1.2: Matrix 1
The Conditional probability of David that he will call depends on the probability of
Alarm.
42
Table 1.4: Matrix 3
From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
1.10.3 The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which is
given below:
1. To understand the network as the representation of the Joint probability
distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional
independence statements.
It is helpful in designing inference procedure.
43
Lecture: 7
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then
we test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will
see the extreme case of cat and dog. On the basis of the support vectors, it will classify it
as a cat. Consider the below diagram:
44
Figure 1.0.2: Prediction using SVM
SVM algorithm can be used for Face detection, image classification, text categorization,
etc.
45
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:
46
Figure 1.0.4: Hyperplane
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is
to maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
47
Figure 1.0.6: Non-Linear SVM
So, to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third-
dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
o now, SVM will divide the datasets into classes in the following way. Consider the below
image:
48
Figure 1.0.8: Drawing Hyperplane
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
50
1.12.1 Selection
After calculating the fitness of every existent in the population, a selection process is used
to determine which of the individualities in the population will get to reproduce and
produce the seed that will form the coming generation.
Types of selection styles available
i. Roulette wheel selection
ii. Event selection
iii. Rank- grounded selection
So, now we can define a genetic algorithm as a heuristic search algorithm to solve
optimization problems. It is a subset of evolutionary algorithms, which is used in
computing. A genetic algorithm uses genetic and natural selection concepts to solve
optimization problems.
I. Initialization
The process of a genetic algorithm starts by generating the set of individuals, which is
called population. Here each individual is the solution for the given problem. An
individual contains or is characterized by a set of parameters called Genes. Genes are
combined into a string and generate chromosomes, which is the solution to the problem.
One of the most popular techniques for initialization is the use of random binary strings.
51
Figure 1.0.1: Genetic Components
Fitness function is used to determine how fit an individual is? It means the ability of an
individual to compete with other individuals. In every iteration, individuals are evaluated
based on their fitness function. The fitness function provides a fitness score to each
individual. This score further determines the probability of being selected for
reproduction. The high the fitness score, the more chances of getting selected for
reproduction.
III. Selection
The selection phase involves the selection of individuals for the reproduction of offspring.
All the selected individuals are then arranged in a pair of two to increase reproduction.
Then these individuals transfer their genes to the next generation.
There are three types of Selection methods available, which are:
1. Roulette wheel selection
2. Tournament selection
3. Rank-based selection
IV. Reproduction
After the selection process, the creation of a child occurs in the reproduction step. In
this step, the genetic algorithm uses two variation operators that are applied to the
parent population. The two operators involved in the reproduction phase are given below:
Crossover: The crossover plays a most significant role in the reproduction phase of the
genetic algorithm. In this process, a crossover point is selected
52
at random within the genes. Then the crossover operator swaps genetic information of two
parents from the current generation to produce a new individual representing the offspring.
The genes of parents are exchanged among themselves until the crossover point is met.
These newly generated offspring are added to the population. This process is also called
or crossover. Types of crossover styles available:
One point crossover
Two-point crossover
Livery crossover
Inheritable Algorithms crossover
1.12.2 Mutation
The mutation operator inserts random genes in the offspring (new child) to maintain
the diversity in the population. It can be done by flipping some bits in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances
diversification. The below image shows the mutation process:
Types of mutation styles available,
Flip bit mutation
Gaussian mutation
Exchange/Swap mutation
53
V. Termination
After the reproduction phase, a stopping criterion is applied as a base for termination.
The algorithm terminates after the threshold fitness solution is reached. It will identify
the final solution as the best solution in the population.
General Workflow of a Simple Genetic Algorithm
54
Limitations of Genetic Algorithms
i. Genetic algorithms are not efficient algorithms for solving simple problems.
ii. It does not guarantee the quality of the final solution to a problem.
iii. Repetitive calculation of fitness values may generate some computational challenges.
Difference between Genetic Algorithms and Traditional Algorithms
i. A search space is the set of all possible solutions to the problem. In the traditional
algorithm, only one set of solutions is maintained, whereas, in a genetic algorithm, several
sets of solutions in search space can be used.
ii. Traditional algorithms need more information in order to perform a search, whereas genetic
algorithms need only one objective function to calculate the fitness of an individual.
iii. Traditional Algorithms cannot work parallelly, whereas genetic Algorithms can work
parallelly (calculating the fitness of the individualities are independent).
iv. One big difference in genetic Algorithms is that rather of operating directly on seeker
results, inheritable algorithms operate on their representations (or rendering), frequently
appertained to as chromosomes.
v. One of the big differences between traditional algorithm and genetic algorithm is that it
does not directly operate on candidate solutions.
vi. Traditional Algorithms can only generate one result in the end, whereas Genetic
Algorithms can generate multiple optimal results from different generations.
vii. The traditional algorithm is not more likely to generate optimal results, whereas Genetic
algorithms do not guarantee to generate optimal global results, but also there is a great
possibility of getting the optimal result for a problem as it uses genetic operators such as
Crossover and Mutation.
viii. Traditional algorithms are deterministic in nature, whereas Genetic algorithms are
probabilistic and stochastic in nature.
55
Lecture: 8
o Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well
as accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results obtained in
machine learning models. Hence, incorrect data may affect the accuracy of the results
also.
o Generalizing of output data- Sometimes, it is also found that generalizing output data
becomes complex, which results in comparatively poor future actions.
56
ii. Poor quality of data
As we have discussed above, data plays a significant role in machine learning, and it
must be of good quality as well. Noisy data, incomplete data, inaccurate data, and unclean
data lead to less accuracy in classification and low-quality results. Hence, data quality
can also be considered as a major common problem while processing machine learning
algorithms.
iii. Non-representative training data
To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training
data must cover all cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well for
generalized cases and provides accurate decisions. If there is less training data, then there
will be a sampling noise in the model, called the non-representative training set. It
won't be accurate in predictions. To overcome this, it will be biased against one class or
a group.
Hence, we should use representative data in training to protect against being biased and
make accurate predictions without any drift.
Overfitting:
Overfitting is one of the most common issues faced by Machine Learning engineers
and data scientists. Whenever a machine learning model is trained with a huge amount
of data, it starts capturing noise and inaccurate data into the training data set. It negatively
affects the performance of the model. Let's understand with a simple example where
we have a few training data sets such as 1000 mangoes, 1000 apples, 1000 bananas,
and 5000 papayas. Then there is a considerable probability of identification of an apple
as papaya because we have a massive amount of biased data in the training data set;
hence prediction got negatively affected. The main reason behind overfitting is using non-
linear methods used in machine learning algorithms as they build non-realistic data
models. We can overcome overfitting by using linear and parametric algorithms in
the machine learning models.
Methods to reduce overfitting:
58
However, we can overcome this by regularly updating and monitoring data according
to the expectations.
59
o Analyze data regularly and keep tracking errors to resolve them easily.
o Review the collected and annotated data.
o Use multi-pass annotation such as sentiment analysis, content moderation, and
intent recognition.
xi. Lack of Explainability
This basically means the outputs cannot be easily comprehended as it is programmed in
specific ways to deliver for certain conditions. Hence, a lack of explainability is also
found in machine learning algorithms which reduce the credibility of the algorithms.
60
Tools and Python, R, SQL, Python, R, TensorFlow,
Technologies Tableau, Hadoop, etc. Scikit-Learn, PyTorch, etc.
Processes Data cleaning, data Data preprocessing, model
Involved analysis, data training, model testing, and
visualization, and model deployment.
interpretation.
rket analysis, data Predictive analytics, speech
reporting, siness recognition, recommendation
Applications
analytics, predictive systems, self-driving cars.
deling.
Statistical analysis, Deep understanding of
data visualization, big algorithms, neural networks,
Skills Required
data platforms, statistical modeling, and natural
domain-specific language processing.
knowledge.
To extract insights To enable machines to learn
and knowledge from from data so they can provide
End Goal
data in various accurate predictions and
formats. decisions.
Data Analyst, Data Machine Learning Engineer, AI
Scientist, Data Engineer, Research Scientist,
Career Path
Engineer, Business Data Scientist.
Analyst.
61
1.15 Important Question (Previous Year Questions)
62
UNIT 2- Regression, Bayesian Network, SVM
REGRESSION
Lecture: 9
63
Figure 2.1: Linear Regression
In above image the dependent variable is on Y-axis (salary) and independent variable is
on x-axis(experience). The regression line can be written as:
y= a0+a1x+ ε
64
2.2 Logistic Regression:
Logistic regression is one of the most popular Machine learning algorithm that comes
under Supervised Learning techniques.
It can be used for Classification as well as for Regression problems, but mainly used
for Classification problems.
Logistic regression is used to predict the categorical dependent variable with the help
of independent variables.
The output of Logistic Regression problem can be only between the 0 and 1.
Logistic regression can be used where the probabilities between two classes is
required. Such as whether it will rain today or not, either 0 or 1, true or false etc.
Logistic regression is based on the concept of Maximum Likelihood estimation.
According to this estimation, the observed data should be most probable.
In logistic regression, we pass the weighted sum of inputs through an activation function
that can map values in between 0 and 1. Such activation function is known as sigmoid
function and the curve obtained is called as sigmoid curve or S-curve. Consider the below
image:
65
o The equation for logistic regression is:
66
2.3.1 Difference between Linear Regression and Logistic Regression:
67
In Logistic regression, it is not
In Linear regression, it is required that
required to have the
relationship between dependent variable
linear
68
Lecture: 10
69
with known event Y:
o According to the product rule we can express as the probability of event X with
known event Y as follows;
P(X ? Y)= P(X|Y) P(Y) {equation 1}
o Further, the probability of event Y with known event X:
P(X ? Y)= P(Y|X) P(X) {equation 2}
Mathematically, Bayes theorem can be expressed by combining both equations on
right hand side. We will get:
Here, both events X and Y are independent events which means probability of outcome
of both events does not depends one another.
The above equation is called as Bayes Rule or Bayes Theorem.
o P(X|Y) is called as posterior, which we need to calculate. It is defined as updated
probability after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is
true.
o P(X) is called the prior probability, probability of hypothesis before considering the
evidence
o P(Y) is called marginal probability. It is defined as the probability of evidence under any
consideration.
70
ii. Sample Space
During an experiment what we get as a result is called as possible outcomes and the set of
all possible outcome of an event is known as sample space. For example, if we are rolling
a dice, sample space will be:
S1 = {1, 2, 3, 4, 5, 6}
Similarly, if our experiment is related to toss a coin and recording its outcomes,
then sample space will be:
S2 = {Head, Tail}
iii. Event
Event is defined as subset of sample space in an experiment. Further, it is also called as
set of outcomes.
Assume in our experiment of rolling a dice, there are two event A and B such that;
A∩B= {6}
o Disjoint Event: If the intersection of the event A and B is an empty set or null then
such events are known as disjoint event or mutually exclusive events also.
72
iv. Random Variable:
It is a real value function which helps mapping between sample space and a real line of an
experiment. A random variable is taken on some random values and each value
having some probability. However, it is neither random nor a variable but it behaves as a
function which can either be discrete, continuous or combination of both.
v. Exhaustive Event:
As per the name suggests, a set of events where at least one event occurs at a time, called
exhaustive event of an experiment.
Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a
time and both are mutually exclusive for e.g., while tossing a coin, either it will be a
Head or may be a Tail.
vi. Independent Event:
Two events are said to be independent when occurrence of one event does not affect the
occurrence of another event. In simple words we can say that the probability of
outcome of both events does not depends one another.
Mathematically, two events A and B are said to be independent if: P(A ∩ B) = P(AB) =
P(A)*P(B)
vii. Conditional Probability:
Conditional probability is defined as the probability of an event A, given that another
event B has already occurred (i.e. A conditional B). This is represented by P(A|B) and we
can define it as:
P(A|B) = P(A ∩ B) / P(B)
viii. Marginal Probability:
Marginal probability is defined as the probability of an event A occurring independent
of any other event B. Further, it is considered as the probability of evidence under any
consideration.
P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)
73
Here ~B represents the event that B does not occur.
74
of potential hypotheses for the hypothesis that best fits the training examples.
What is Concept Learning…?
“A task of acquiring potential hypothesis (solution) that best fits the given training
examples.”
Consider the example task of learning the target concept “days on which my friend
Prabhas enjoys his favorite water sport.”
Below Table describes a set of example days, each represented by a set of attributes. The
attribute EnjoySport indicates whether or not Prabhas enjoys his favorite water sport
on this day. The task is to learn to predict the value of EnjoySport for an arbitrary day,
based on the values of its other attributes.
75
What hypothesis representation shall we provide to the learner in this case?
76
What hypothesis representation shall we provide to the learner in this case?
Let us begin by considering a simple representation in which each hypothesis consists of
a conjunction of constraints on the instance attributes.
In particular, let each hypothesis be a vector of six constraints, specifying the values of
the six attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.
For each attribute, the hypothesis will either
• indicate by a “?’ that any value is acceptable for this attribute,
• specify a single required value (e.g., Warm) for the attribute, or
• indicate by a “ø” that no value is acceptable.
If some instance x satisfies all the constraints of hypothesis h, then h classifies x as
a positive example (h(x) = 1).
To illustrate, the hypothesis that Prabhas enjoys his favorite sport only on cold days with
high humidity (independent of the values of the other attributes) is represented by the
expression
(?, Cold, High, ?, ?, ?)
Most General and Specific Hypothesis
The most general hypothesis-that every day is a positive example-is represented by (?,
?, ?, ?, ?, ?) and the most specific possible hypothesis-that no day is a positive example-
is represented by (ø, ø, ø, ø, ø, ø).
Instance Space
Hypothesis Space
Similarly there are 5 . 4 . 4 . 4 . 4 . 4 = 5120 syntactically distinct hypotheses within H.
Notice, however, that every hypothesis containing one or more “ø” symbols represents
78
the empty set of instances; that is, it classifies every instance as negative.
Therefore, the number of semantically distinct hypotheses is only 1 + (4 . 3 . 3 . 3 . 3 . 3) =
973.
Our EnjoySport example is a very simple learning task, with a relatively small,
finite hypothesis space.
79
2.6 Bayes Optimal Classifier and Naive Bayes Classifier
The Bayes Optimal Classifier is a probabilistic model that predicts the most likely
outcome for a new situation. In this blog, we’ll have a look at Bayes optimal
classifier and Naive Bayes Classifier.
The Bayes theorem is a method for calculating a hypothesis’s probability based on its prior
probability, the probabilities of observing specific data given the hypothesis, and the seen
data itself.
80
Lecture: 11
81
Directed Acyclic Graph
Each node corresponds to the random variables, and a variable can be continuous or
discrete.
Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in
the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
In the above diagram, A, B, C, and D, are random variables represented by the nodes of
the network graph.
If we are considering node B, which is connected with node A by a directed arrow,
then node A is called the parent of Node B.
Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG
82
The Bayesian network has mainly two components:
i. Causal Component
ii. Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability.
So let's first understand the joint probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
= P[x1| x2, x3,....., xn]P[x2, x3, , xn]
In general for each variable Xi, we can write the equation as: P(Xi|Xi-1, , X1) = P(Xi
|Parents(Xi )).
83
Lecture: 12
84
2.9.2 EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms, such
as the k-means clustering algorithm. Being an iterative approach, it consists of two
modes. In the first mode, we estimate the missing or latent variables. Hence it is referred
to as the Expectation/estimation step (E-step). Further, the other mode is used to
optimize the parameters of the models so that it can explain the data more clearly. The
second mode is known as the maximization-step or M-step.
iii. Expectation step (E - step): It involves the estimation (guess) of all missing values in
the dataset so that after completing this step, there should not be any missing value.
iv. Maximization step (M - step): This step involves the use of estimated data in the E-
step and updating the parameters.
v. Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of the dataset
to estimate the missing data of the latent variables and then use that data to update the
values of the parameters in the M-step.
85
2.9.2 What is Convergence in the EM algorithm?
Convergence is defined as the specific situation in probability based on intuition,
e.g., if there are two random variables that have very less difference in their
probability, then they are known as converged. In other words, whenever the values
of given variables are matched with each other, it is called convergence.
2.9.3 Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization Step,
Expectation Step, Maximization Step, and convergence Step. These steps are
explained as follows:
vi. 1st Step: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained from
a specific model.
vii. 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or guess
the values of the missing or incomplete data using the observed data. Further, E-step
primarily updates the variables.
86
viii. 3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily updates
the hypothesis.
ix. 4th step: The last step is to check if the values of latent variables are converging or not. If
it gets "yes", then stop the process; else, repeat the process from step 2 until the
convergence occurs.
2.9.3 Gaussian Mixture Model (GMM)
The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function. Further, GMM also
requires estimated statistics values such as mean and standard deviation or parameters.
It is used to estimate the parameters of the probability distributions to best fit the
density of a given training dataset. Although there are plenty of techniques available
to estimate the parameter of the Gaussian Mixture Model (GMM), the Maximum
Likelihood Estimation is one of the most popular techniques among them.
Let's understand a case where we have a dataset with multiple data points generated
by two different processes. However, both processes contain a similar Gaussian
probability distribution and combined data. Hence it is very difficult to discriminate which
distribution a given point may belong to.
The processes used to generate the data point represent a latent variable or unobservable
data. In such cases, the Estimation-Maximization algorithm is one of the best techniques
which helps us to estimate the parameters of the gaussian distributions. In the EM
algorithm, E-step estimates the expected value for each latent variable, whereas M-step
helps in optimizing them significantly using the Maximum Likelihood Estimation (MLE).
Further, this process is repeated until a good set of latent values, and a maximum
likelihood is achieved that fits the data.
87
x. The EM algorithm is applicable in data clustering in machine learning. It is often used in
computer vision and NLP (Natural language processing).
xi. It is used to estimate the value of the parameter in mixed models such as the
Gaussian Mixture Model and quantitative genetics.
xii. It is also used in psychometrics for estimating item parameters and latent abilities of item
response theory models.
xiii. It is also applicable in the medical and healthcare industry, such as in image
reconstruction and structural engineering.
xiv. It is used to determine the Gaussian density of a function.
Advantages of EM algorithm
i. It is very easy to implement the first two basic steps of the EM algorithm in various
machine learning problems, which are E-step and M- step.
ii. It is mostly guaranteed that likelihood will enhance after each iteration.
iii. It often generates a solution for the M-step in the closed form.
Disadvantages of EM algorithm
i. The convergence of the EM algorithm is very slow.
ii. It can make convergence for the local optima only.
iii. It takes both forward and backward probability into consideration. It is opposite to that
of numerical optimization, which takes only forward probabilities.
88
Lecture: 13
2.10 Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then
we test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will
see the extreme case of cat and dog. On the basis of the support vectors, it will classify it
as a cat. Consider the below diagram:
89
Figure 2.12: Data Flow
SVM algorithm can be used for Face detection, image classification, text categorization,
etc.
90
Lecture: 14
2.11 Types of SVM
SVM can be of two types:
[Link] SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
[Link]-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
Major Kernel Functions in Support Vector Machine
What is Kernel Method?
A set of techniques known as kernel methods are used in machine learning to address
classification, regression, and other prediction issues. They are built around the idea of
kernels, which are functions that gauge how similar two data points are to one another
in a high-dimensional feature space.
Kernel methods' fundamental premise is used to convert the input data into a high-
dimensional feature space, which makes it simpler to distinguish between classes or
generate predictions. Kernel methods employ a kernel function to implicitly map the data
into the feature space, as opposed to manually computing the feature space.
The most popular kind of kernel approach is the Support Vector Machine (SVM), a binary
classifier that determines the best hyperplane that most effectively divides the two
groups. In order to efficiently locate the ideal hyperplane, SVMs map the input into a
higher-dimensional space using a kernel function.
Other examples of kernel methods include kernel ridge regression, kernel PCA, and
Gaussian processes. Since they are strong, adaptable, and computationally efficient, kernel
approaches are frequently employed in machine learning. They are resilient to noise
and outliers and can handle sophisticated data structures like strings and graphs.
91
2.11.1 Kernel Method in SVMs
Support Vector Machines (SVMs) use kernel methods to transform the input data into
a higher-dimensional feature space, which makes it simpler to distinguish between classes
or generate predictions. Kernel approaches in SVMs work on the fundamental principle
of implicitly mapping input data into a higher-dimensional feature space without directly
computing the coordinates of the data points in that space.
The kernel function in SVMs is essential in determining the decision boundary that divides
the various classes. In order to calculate the degree of similarity between any two points
in the feature space, the kernel function computes their dot product.
The most commonly used kernel function in SVMs is the Gaussian or radial basis function
(RBF) kernel. The RBF kernel maps the input data into an infinite- dimensional feature
space using a Gaussian function. This kernel function is popular because it can capture
complex nonlinear relationships in the data.
Other types of kernel functions that can be used in SVMs include the polynomial kernel,
the sigmoid kernel, and the Laplacian kernel. The choice of kernel function depends on
the specific problem and the characteristics of the data.
Basically, kernel methods in SVMs are a powerful technique for solving classification and
regression problems, and they are widely used in machine learning because they can
handle complex data structures and are robust to noise and outliers.
92
v. Reproducing property: A kernel function satisfies the reproducing property if it
can be used to reconstruct the input data in the feature space.
vi. Smoothness: A kernel function is said to be smooth if it produces a smooth
transformation of the input data into the feature space.
vii. Complexity: The complexity of a kernel function is an important consideration,
as more complex kernel functions may lead to over fitting and reduced
generalization performance.
viii. Basically, the choice of kernel function depends on the specific problem and the
characteristics of the data, and selecting an appropriate kernel function can
significantly impact the performance of machine learning algorithms.
2.11.2 Major Kernel Function in Support Vector Machine
In Support Vector Machines (SVMs), there are several types of kernel functions that can
be used to map the input data into a higher-dimensional feature space. The choice of kernel
function depends on the specific problem and the characteristics of the data.
Here are some most commonly used kernel functions in SVMs:
Linear Kernel
A linear kernel is a type of kernel function used in machine learning, including in SVMs
(Support Vector Machines). It is the simplest and most commonly used kernel function,
and it defines the dot product between the input vectors in the original feature space.
The linear kernel can be defined as:
K(x, y) = x .y
Where x and y are the input feature vectors. The dot product of the input vectors is a
measure of their similarity or distance in the original feature space.
When using a linear kernel in an SVM, the decision boundary is a linear hyperplane that
separates the different classes in the feature space. This linear boundary can be useful
when the data is already separable by a linear decision boundary or when dealing with
high-dimensional data, where the use of more complex kernel functions may lead to
overfitting.
Polynomial Kernel
A particular kind of kernel function utilised in machine learning, such as in SVMs, is
a polynomial kernel (Support Vector Machines). It is a nonlinear kernel function that
employs polynomial functions to transfer the input data into a higher-dimensional feature
space.
One definition of the polynomial kernel is:
Where x and y are the input feature vectors, c is a constant term, and d is the degree
of the polynomial, K(x, y) = (x. y + c) d. The constant term is added to, and the dot
product of the input vectors elevated to the degree of the polynomial.
The decision boundary of an SVM with a polynomial kernel might capture more intricate
correlations between the input characteristics because it is a nonlinear hyperplane.
93
The degree of nonlinearity in the decision boundary is determined by the degree of the
polynomial.
The polynomial kernel has the benefit of being able to detect both linear and nonlinear
correlations in the data. It can be difficult to select the proper degree of the polynomial,
though, as a larger degree can result in overfitting while a lower degree cannot adequately
represent the underlying relationships in the data.
In general, the polynomial kernel is an effective tool for converting the input data
into a higher-dimensional feature space in order to capture nonlinear correlations between
the input characteristics.
Gaussian (RBF) Kernel
The Gaussian kernel, also known as the radial basis function (RBF) kernel, is a
popular kernel function used in machine learning, particularly in SVMs (Support Vector
Machines). It is a nonlinear kernel function that maps the input data into a higher-
dimensional feature space using a Gaussian function.
The Gaussian kernel can be defined as:
Where x and y are the input feature vectors, gamma is a parameter that controls the width
of the Gaussian function, and ||x - y||^2 is the squared Euclidean distance between the input
vectors.
When using a Gaussian kernel in an SVM, the decision boundary is a nonlinear hyper
plane that can capture complex nonlinear relationships between the input features.
The width of the Gaussian function, controlled by the gamma parameter, determines the
degree of nonlinearity in the decision boundary.
One advantage of the Gaussian kernel is its ability to capture complex relationships in
the data without the need for explicit feature engineering. However, the choice of the
gamma parameter can be challenging, as a smaller value may result in under fitting, while
a larger value may result in over fitting.
Laplace Kernel
The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is a type
of kernel function used in machine learning, including in SVMs (Support Vector
Machines). It is a non-parametric kernel that can be used to measure the similarity or
distance between two input feature vectors.
The Laplacian kernel can be defined as:
Where x and y are the input feature vectors, gamma is a parameter that controls the width
94
of the Laplacian function, and ||x - y|| is the L1 norm or Manhattan distance between the
input vectors.
When using a Laplacian kernel in an SVM, the decision boundary is a nonlinear
hyperplane that can capture complex relationships between the input features. The width
of the Laplacian function, controlled by the gamma parameter, determines the degree
of nonlinearity in the decision boundary.
One advantage of the Laplacian kernel is its robustness to outliers, as it places less weight
on large distances between the input vectors than the Gaussian kernel. However, like the
Gaussian kernel, choosing the correct value of the gamma parameter can be challenging.
95
Lecture: 15
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green
or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:
96
Figure 2.13: Hyperplane
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is
to maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
97
Figure 2.15: Non- Linear
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
o now, SVM will divide the datasets into classes in the following way. Consider the below
image:
98
Figure 2.16: Best Hyperplane
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If
we convert it in 2d space with z=1, then it will become as:
99
Lecture: 16
2.13 Properties of SVM
Flexibility in choosing a similarity function
Sparseness of solution when dealing with large data sets
only support vectors are used to specify the separating hyperplane
Ability to handle large feature spaces
Complexity does not depend on the dimensionality of the feature space
Overfitting can be controlled by soft margin approach
Nice math property: a simple convex optimization problem which
is guaranteed to converge to a single global solution
Feature Selection
100
iii. More features, more complexities
More the features are taken into consideration, it will result in more dimensions
coming into p l a y . If the number of features is much greater than the number
of samples, avoid over-fitting in choosing Kernel functions and regularization term
is crucial.
101
2.14 Important Questions (Previous Year Question)
Q1: Explain Linear, Polynomial and Gaussian Kernel (Radial Basis Function) in detail.
Q2: Differentiate between Linear Regression and Logistic Regression.
Q3: What are the types of Logistics Regression?
Q4: Describe briefly Linear Regression and Logistic Regression.
Q5: What is the assumption in Naïve Bayesian Algorithm that makes it different from
Bayesian Theorem?
Q6: Discuss the various properties and issues of SVM.
Q7: Why SVM is an example of a large margin classifier? Discuss the different kernel
functions used in SVM.
Q8: Explain the EM algorithm with the necessary steps.
Q9: Write short note on “Bayesian Belief Networks”.
Q10: What is Bayesian Learning? Explain how the decision error for Bayesian
Classification is minimized.
Q11: Define Bayes Classifier. Explain how Classification is done using Bayes Classifier.
Q12: Discuss Bayes Classifier using some examples in detail.
Q13: Explain Naïve Bayes Classifier.
Q14: Describe the Usage, Advantages and Disadvantages of EM Algorithm.
Q15: How is the Bayesian Network powerful representation for uncertainty knowledge?
Explain with example.
Q16: Explain the role Prior Probability and Posterior Probability in Bayesian
Classification.
Q17: Explain the types and properties of Support Vector Machine.
Q18: What are the parameters used in Support Vector Classifier?
Q19: What are the Advantages and Disadvantages of Support Vector Machines?
Q20: Write a short Note on Hyper plane (Decision Surface).
102
UNIT 3 – Decision Tree Learning
105
Figure 3.2: Decision Tree Example
108
The ILA uses the method of production of a general set of rules instead of decision
trees, which overcomes the above problems
Basic Requirements to Apply Inductive Learning Algorithm
i. List the examples in the form of a table ‘T’ where each row corresponds to an example
and each column contains an attribute value.
ii. Create a set of m training examples, each example composed of k attributes and a
class attribute with n possible decisions.
iii. Create a rule set, R, having the initial value false.
iv. Initially, all rows in the table are unmarked.
Necessary Steps for Implementation
Step 1: divide the table ‘T’ containing m examples into n sub-tables (t1, t2,…..tn).
One table for each possible value of the class attribute. (repeat steps 2-8 for each
sub-table)
Step 2: Initialize the attribute combination count ‘ j ‘ = 1.
Step 3: For the sub-table on which work is going on, divide the attribute list into
distinct combinations, each combination with ‘j ‘ distinct attributes.
Step 4: For each combination of attributes, count the number of occurrences of
attribute values that appear under the same combination of attributes in unmarked
rows of the sub-table under consideration, and at the same time, not appears under
the same combination of attributes of other sub-tables. Call the first combination
with the maximum number of occurrences the max-combination ‘ MAX’.
Step 5: If ‘MAX’ == null, increase ‘ j ‘ by 1 and go to Step 3.
Step 6: Mark all rows of the sub-table where working, in which the values of ‘MAX’
appear, as classified.
Step 7: Add a rule (IF attribute = “XYZ” –> THEN decision is YES/ NO) to R whose
left-hand side will have attribute names of the ‘MAX’ with their values separated by
AND, and its right-hand side contains the decision attribute value associated with
the sub-table.
Step 8: If all rows are marked as classified, then move on to process another sub-table
and go to Step 2. Else, go to Step 4. If no sub-tables are available, exit with the set
of rules obtained till then.
109
Lecture: 19
Entropy is a measurement
Information gain is a metric for the
of the disorder or impurity
entropy reduction brought about by
of a set of occurrences. It
segmenting a set of instances
determines the usual
according to a feature. It gauges the
amount of information
amount of knowledge a characteristic
needed to classify a sample
imparts to the class of an example.
taken from the collection.
111
disorder or impurity present maximum information gain, the
in a collection of instances objective of information gain is to
and aims to be minimized maximize the utility of a feature for
by identifying the ideal categorization.
division.
Calculating probabilities
Entropies and weighted averages
and logarithms, which can
must be calculated in order to gather
be computationally costly,
information, which can be
is necessary to determine
computationally costly.
entropy.
If there are too many If the tree is too deep or there are too
characteristics or the tree is many irrelevant characteristics,
too deep, entropy might information gain may potentially
result in overfitting. result in overfitting.
112
Lecture: 20
This article targets to clearly explain the ID3 Algorithm (one of the many
Algorithms used to build Decision Trees) in detail. We explain the algorithm using
a fake sample Covid-19 dataset.
The picture above depicts a decision tree that is used to classify whether a person
is Fit or Unfit.
The decision nodes here are questions like ‘’‘Is the person less than 30 years of
age?’, ‘Does the person eat junk?’, etc. and the leaves are one of the two possible
outcomes viz. Fit and Unfit.
Looking at the Decision Tree we can say make the following decisions:
if a person is less than 30 years of age and doesn’t eat junk food then he is Fit, if a
person is less than 30 years of age and eats junk food then he is Unfit and so on.
The initial node is called the root node (colored in blue), the final nodes are called
the leaf nodes (colored in green) and the rest of the nodes are
called intermediate or internal nodes.
The root and intermediate nodes represent the decisions while the leaf nodes
represent the outcomes.
113
ID3
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm
iteratively (repeatedly) dichotomizes(divides) features into two or more groups at
each step.
Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision
tree. In simple words, the top-down approach means that we start building the tree
from the top and the greedy approach means that at each iteration we select the best
feature at the present moment to create a node.
Most generally ID3 is only used for classification problems with nominal features
only.
Dataset description
In this article, we’ll be using a sample dataset of COVID-19 infection. A preview of
the entire dataset is shown below.
+----+-------+-------+------------------+----------+
+----+-------+-------+------------------+----------+
| 1 | NO | NO | NO | NO |
+----+-------+-------+------------------+----------+
+----+-------+-------+------------------+----------+
| 3 | YES | YES | NO | NO |
+----+-------+-------+------------------+----------+
+----+-------+-------+------------------+----------+
114
+----+-------+-------+------------------+----------+
| 6 | NO | YES | NO | NO |
+----+-------+-------+------------------+----------+
+----+-------+-------+------------------+----------+
+----+-------+-------+------------------+----------+
+----+-------+-------+------------------+----------+
+----+-------+-------+------------------+----------+
| 11 | NO | YES | NO | NO |
+----+-------+-------+------------------+----------+
+----+-------+-------+------------------+----------+
| 13 | NO | YES | YES | NO |
+----+-------+-------+------------------+----------+
| 14 | YES | YES | NO | NO |
+----+-------+-------+------------------+----------+
115
The columns are self-explanatory. Y and N stand for Yes and No respectively. The
values or classes in Infected column Y and N represent Infected and Not Infected
respectively.
The columns used to make decision nodes viz. ‘Breathing Issues’, ‘Cough’ and
‘Fever’ are called feature columns or just features and the column used for leaf nodes
i.e. ‘Infected’ is called the target column.
Metrics in ID3
As mentioned previously, the ID3 algorithm selects the best feature at each step
while building a Decision tree.
Before you ask, the answer to the question: ‘How does ID3 select the best feature?’
is that ID3 uses Information Gain or just Gain to find the best feature.
Information Gain calculates the reduction in the entropy and measures how well a
given feature separates or classifies the target classes. The feature with the highest
Information Gain is selected as the best one.
In simple words, Entropy is the measure of disorder and the Entropy of a dataset is
the measure of disorder in the target feature of the dataset.
In the case of binary classification (where the target column has only two types of
classes) entropy is 0 if all values in the target column are homogenous(similar) and
will be 1 if the target column has equal number values for both the classes.
We denote our dataset as S, entropy is calculated as:
Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
where,
n is the total number of classes in the target column
(in our case n = 2 i.e YES and NO)
pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the
target column” to the “total number of rows” in the dataset.
Information Gain for a feature column A is calculated as:
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
| YES | YES | NO | NO |
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
117
| YES | YES | YES | YES |
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
| YES | YES | NO | NO |
+-------+-------+------------------+----------+
As shown below, in the 6 rows with NO, there are 2 rows having target
value YES and 4 rows having target value NO.
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
| NO | NO | NO | NO |
+-------+-------+------------------+----------+
| NO | YES | NO | NO |
+-------+-------+------------------+----------+
118
| NO | YES | YES | YES |
+-------+-------+------------------+----------+
| NO | YES | NO | NO |
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+
The block, below, demonstrates the calculation of Information Gain for Fever.
# total rows
(|Sɴᴏ| / |S|) * Entropy(Sɴᴏ)∴ IG(S, Fever) = 0.99 - (8/14) * 0.81 - (6/14) * 0.91 =
0.13
Next, we calculate the IG for the features “Cough” and “Breathing issues”.
You can use this free online tool to calculate the Information Gain.
119
IG(S, Cough) = 0.04
Next, from the remaining two unused features, namely, Fever and Cough, we decide
which one is the best for the left branch of Breathing Issues.
Since the left branch of Breathing Issues denotes YES, we will work with the subset
of the original data i.e the set of rows having YES as the value in the Breathing Issues
column. These 8 rows are shown below:
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
120
| YES | NO | YES | YES |
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+
Next, we calculate the IG for the features Fever and Cough using the subset Sʙʏ
(Set Breathing Issues Yes) which is shown above :
Note: For IG calculation the Entropy will be calculated from the subset Sʙʏ and not
the original dataset S.
Next, we find the feature with the maximum IG for the right branch of Breathing
Issues. But, since there is only one unused feature left we have no other choice but
to make it the right branch of the root node.
121
So our tree now looks like this:
There are no more unused features, so we stop here and jump to the final step of
creating the leaf nodes.
For the left leaf node of Fever, we see the subset of rows from the original data set
that has Breathing Issues and Fever both values as YES.
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
122
+-------+-------+------------------+----------+
Since all the values in the target column are YES, we label the left leaf node as YES,
but to make it more logical we label it Infected.
Similarly, for the right node of Fever we see the subset of rows from the original
data set that have Breathing Issues value as YES and Fever as NO.
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+
| NO | YES | YES | NO |
+-------+-------+------------------+----------+
Here not all but most of the values are NO, hence NO or Not Infected becomes
our right leaf node.
Our tree, now, looks like this:
We repeat the same process for the node Cough, however here both left and right
leaves turn out to be the same i.e. NO or Not Infected as shown below:
123
Figure 3.7: Example
124
Lecture: 21
3.9 k-NN Learning
K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s
nearby. Imagine a streaming service wants to predict if a new user is likely to cancel
their subscription (churn) based on their age. They checks the ages of its existing
users and whether they churned or stayed. If most of the “K” closest users in age of
new user canceled their subscription KNN will predict the new user might churn too.
The key idea is that users with similar ages tend to have similar behaviors and KNN
uses this closeness to make decisions.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the dataset and at the time
of classification it performs an action on the dataset.
In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the
algorithm how many nearby points (neighbours) to look at when it makes a decision.
Example:
Imagine you’re deciding which fruit it is based on its shape and size. You compare
it to fruits you already know.
If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is
an apple because most of its neighbours are apples.
Cross-Validation: A robust method for selecting the best k is to perform k-fold cross-
validation. This involves splitting the data into k subsets training the model on some
125
subsets and testing it on the remaining ones and repeating this for each subset. The
value of k that results in the highest average validation accuracy is usually the best
choice.
Elbow Method: In the elbow method we plot the model’s error rate or accuracy for
different values of k. As we increase k the error usually decreases initially. However
after a certain point the error rate starts to decrease more slowly. This point where the
curve forms an “elbow” that point is considered as best k.
Odd Values for k: It’s also recommended to choose an odd value for k especially in
classification tasks to avoid ties when deciding the majority class.
KNN uses distance metrics to identify nearest neighbour, these neighbours are used
for classification and regression task. To identify nearest neighbour we use below
distance metrics:
1. Euclidean Distance
distance(x,Xi)=∑j=1d(xj–Xij)2]distance(x,Xi)=∑j=1d(xj–Xij)2]
2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal
and vertical lines (like a grid or city streets). It’s also called “taxicab distance”
because a taxi can only drive along the grid-like streets of a city.
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
3. Minkowski Distance
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
126
From the formula above we can say that when p = 2 then it is the same as the formula
for the Euclidean distance and when p = 1 then we obtain the formula for the
Manhattan distance.
So, you can think of Minkowski as a flexible distance formula that can look like
either Manhattan or Euclidean distance depending on the value of p.
127
Lecture: 22
3.10 Locally Weighted Regression
Locally Weighted Linear Regression (LWLR) is a non-parametric, memory-
based algorithm designed to capture non-linear relationships in data. Unlike
traditional regression models that fit a single global line across the dataset, LWLR
creates localized models for subsets of data points near the query point. Each query
point has its own regression line based on weighted contributions from nearby data
points.
LWLR assigns weights to data points based on their proximity to the query point:
This approach allows LWLR to adapt to local data structures, making it effective for
modeling non-linear relationships.
Global Linear Regression: Fits a single line to the entire dataset, assuming a uniform
relationship across all data points.
For instance, in predicting housing prices, LWLR can handle neighborhoods with
distinct pricing trends better than global linear regression, which might oversimplify
the relationship.
Let’s consider a scenario where we want to predict housing prices based on the size
of the house. In a dataset, the relationship between size and price might vary across
different neighborhoods due to local factors like amenities or location.
Global Linear Regression: Fits a single line to the entire dataset, assuming a uniform
relationship between size and price across all neighborhoods. This may lead to
inaccurate predictions in areas where the relationship deviates.
128
Locally Weighted Linear Regression: Focuses on the specific neighborhood by
giving more weight to houses closer in size to the query house. This results in a better
prediction tailored to the local trends.
Visualization
LWLR would fit smaller localized lines that closely follow the variations in data for
each neighborhood.
This adaptability allows LWLR to model more complex relationships, such as sharp
changes in housing prices in specific regions.
The process of Locally Weighted Linear Regression involves several key steps,
ensuring the model captures local patterns effectively.
Preprocess the data by handling missing values and normalizing features to ensure a
consistent scale, which improves the weighting process.
Kernel Function: Determines how weights are assigned to data points based on their
distance from the query point. Common choices include:
Let’s consider a scenario where we want to predict housing prices based on the size
of the house. In a dataset, the relationship between size and price might vary across
different neighborhoods due to local factors like amenities or location.
Global Linear Regression: Fits a single line to the entire dataset, assuming a uniform
relationship between size and price across all neighborhoods. This may lead to
inaccurate predictions in areas where the relationship deviates.
129
Locally Weighted Linear Regression: Focuses on the specific neighborhood by
giving more weight to houses closer in size to the query house. This results in a better
prediction tailored to the local trends.
Visualization
LWLR would fit smaller localized lines that closely follow the variations in data for
each neighborhood.
This adaptability allows LWLR to model more complex relationships, such as sharp
changes in housing prices in specific regions.
The process of Locally Weighted Linear Regression involves several key steps,
ensuring the model captures local patterns effectively.
Preprocess the data by handling missing values and normalizing features to ensure a
consistent scale, which improves the weighting process.
Kernel Function: Determines how weights are assigned to data points based on their
distance from the query point. Common choices include:
Here, is the weight for data point , and τ\tauτ (bandwidth) controls the rate of weight
130
decay.
Bandwidth (Tau): A critical parameter that governs how localized the regression is:
Small : Focuses on nearby points, capturing finer details but risks overfitting.
Large : Includes more distant points, reducing variance but increasing bias.
3. Weight Calculation
For a given query point , compute weights for all data points using the chosen kernel
function. Points closer to will have higher weights.
4. Model Fitting
Using the computed weights, fit a weighted least squares regression to the data.
The goal is to minimize the weighted sum of squared errors:
5. Prediction
Once the localized model is fitted, use it to predict the target value for the query
point.
131
Lecture: 23
134
Lecture: 24
3.12 Case-Based Learning
Problem resolution for customer service help desks, where cases describe product-
related diagnostic problems.
It is also applied to areas such as engineering and law, where cases are either
technical designs or legal rulings, respectively.
Medical educations, where patient case histories and treatments are used to help
diagnose and treat new patients.
Finding a good similarity metric (eg for matching subgraphs) and suitable
methods for combining solutions.
Selecting salient features for indexing training cases and the development of
efficient indexing techniques.
CBR becomes more intelligent as the number of the trade-off between accuracy and
efficiency evolves as the number of stored cases becomes very large. But after a
certain point, the system’s efficiency will suffer as the time required to search for
and process relevant cases increases.
135
3.13 Important Questions (PYQs)
Q1: Explain ID3 Algorithm.
Q2: What is the limitation of Decision Tree?
Q3: Discuss why we use SVM Kernels and in which scenario which SVM kernel is used?
Q4: Discuss the various issues of Decision tree.
Q5: Explain instance based learning with representation?
Q6: How Locally Weighted Regression is different from Radial Basis function networks?
Q7: Explain KNN Algorithm with suitable example.
Q8: Differentiate between Lazy and Eager Learning
Q9: Illustrate the operation of the ID3 training example. Consider the information gain as
attribute measure.
Q10: What are the steps used for making Decision Tree?
Q11: Explain Attribute Selection Measures used in Decision Tree.
Q12: Explain Inductive Bias with Inductive System.
Q13: Explain Inductive Learning Algorithm. Which learning algorithms used in inductive bias?
Q14: What are the Performance Dimensions used for Instance based learning system?
Q15: Explain the Functions, Advantages and Disadvantages of Instance Based Learning.
Q16: Explain Locally Weighted Regression.
Q17: Explain the Architecture of Radial Basis Function Network.
Q18: What are the Functions, Advantages and Disadvantages of Case Based Learning System?
Q19: Describe Case Based Learning Cycle with Limitations, Benefits and Applications.
Q20: What are the Advantages and Disadvantages of KNN Algorithm?
136
UNIT 4- Artificial Neural Networks
Lecture: 25
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we
can consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.
137
4.1.2 What is Binary classifier in Machine Learning?
In Machine Learning, binary classifiers are defined as the function that helps in deciding
whether input data can be represented as vectors of numbers and belongs to some specific
class.
Binary classifiers can be considered as linear classifiers. In simple words, we can understand
it as a classification algorithm that can predict linear predictor function in terms of weight
and feature vectors.
This is the primary component of Perceptron which accepts the initial data into the system
for further processing. Each input node contains a real numerical value.
Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.
Activation Function:
These are the final and important components that help to determine whether the neuron will
fire or not. Activation Function can be considered primarily as a step function.
The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign,
Step, and Sigmoid) in perceptron models by checking whether the learning process is slow
or has vanishing or exploding gradients.
139
Figure 4.3: Structure of ANN
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is
indicative of the strength of a node. Similarly, an input's bias value gives the ability to shift
the activation function curve up or down.
Step-1
In the first step first, multiply all input values with corresponding weight values and then
add them to determine the weighted sum. Mathematically, we can calculate the weighted
sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
140
4.3 Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These are as follows:
141
4.3.1 Advantages of Multi-Layer Perceptron:
A multi-layered perceptron model can be used to solve complex non-linear problems.
It works well with both small and large input data.
It helps us to obtain quick predictions after the training.
It helps to obtain the same accuracy ratio with large as well as small data.
4.3.2 Disadvantages of Multi-Layer Perceptron:
In multi-layer perceptron, computations are difficult and time-consuming.
In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects
each independent variable.
The model functioning depends on the quality of the training.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'.
Mathematically, we can express it as follows:
f(x)=1; if w.x+b>0
otherwise, f(x)=0
'w' represents real-valued weights vector
'b' represents the bias
'x' represents a vector of input x values.
Characteristics of Perceptron
The perceptron model has the following characteristics.
Perceptron is a machine learning algorithm for supervised learning of binary classifiers.
In Perceptron, the weight coefficient is automatically learned.
Initially, weights are multiplied with input features, and the decision is made whether the
neuron is fired or not.
The activation function applies a step rule to check whether the weight function is greater
than zero.
The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.
If the added sum of all input values is more than the threshold value, it must have an output
signal; otherwise, no output will be shown.
Limitations of Perceptron Model
142
A perceptron model has limitations as follows:
The output of a perceptron can only be a binary number (0 or 1) due to the hard limit transfer
function.
Perceptron can only be used to classify the linearly separable sets of input vectors. If input
vectors are non-linear, it is not easy to classify them properly.
Future of Perceptron
The future of the Perceptron model is much bright and significant as it helps to interpret data
by building intuitive patterns and applying them in the future. Machine learning is a rapidly
growing technology of Artificial Intelligence that is continuously evolving and in the
developing phase; hence the future of perceptron technology will continue to support and
facilitate analytical behavior in machines that will, in turn, add to the efficiency of
computers.
The perceptron model is continuously becoming more advanced and working efficiently on
complex problems with the help of artificial neurons.
143
Lecture: 26
4.4 Gradient Descent in Machine Learning
Gradient Descent is known as one of the most commonly used optimization algorithms to
train machine learning models by means of minimizing errors between actual and expected
results. Further, gradient descent is also used to train Neural Networks.
In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about
gradient descent, the role of cost functions specifically as a barometer within Machine
Learning, types of gradient descents, learning rates, etc.
The best way to define the local minimum or local maximum of a function using gradient
descent is as follows:
If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at
the current point, we will get the local maximum of that function.
144
Figure 4.4: Gradient Descent
This entire procedure is known as Gradient Ascent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration. To achieve this goal, it performs two steps iteratively:
Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
Move away from the direction of the gradient, which means slope increased from the current
point by alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in
the optimization process which helps to decide the length of the steps.
The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to reduce
145
the cost function.
Hypothesis:
Parameters:
Cost function:
Goal:
Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
The starting point(shown in above fig.) is used to evaluate the performance as it is considered
just as an arbitrary point. At this starting point, we will derive the first derivative or slope
and then use a tangent line to calculate the steepness of this slope. Further, this slope will
inform the updates to the parameters (weights and bias).
146
The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between
expected and actual. To minimize the cost function, two data points are required:
Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples. This procedure is known as the
147
training epoch. In simple words, it is a greedy approach where we have to sum over all
examples for each update.
It is Computationally efficient as all resources are used for all training samples.
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration. Or in other words, it processes a training epoch for each example
within a dataset and updates each training example's parameters one at a time. As it requires
only one training example at a time, hence it is easier to store in allocated memory. However,
it shows some computational efficiency losses in comparison to batch gradient systems as it
shows frequent updates that require more detail and speed. Further, due to frequent updates,
it is also treated as a noisy gradient. However, sometimes it can be helpful in finding the
global minimum and also escaping the local minimum.
In Stochastic gradient descent (SGD), learning happens on every example, and it consists of
a few advantages over other gradient descent.
Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent. It divides the training datasets into small batch sizes then performs the
updates on those batches separately. Splitting training datasets into smaller batches make a
balance to maintain the computational efficiency of batch gradient descent and speed of
stochastic gradient descent. Hence, we can achieve a special type of gradient descent with
higher computational efficiency and less noisy gradient descent.
Although we know Gradient Descent is one of the most popular methods for optimization
problems, it still also has some challenges. There are a few challenges as follows:
For convex problems, gradient descent can find the global minimum easily, while for non-
convex problems, it is sometimes difficult to find the global minimum, where the machine
learning models achieve the best results.
Whenever the slope of the cost function is at zero or just close to zero, this model stops
learning further. Apart from the global minimum, there occur some scenarios that can show
this slop, which is saddle point and local minimum. Local minima generate the shape similar
to the global minimum, where the slope of the cost function increases on both sides of the
current points.
In contrast, with saddle points, the negative gradient only occurs on one side of the point,
149
which reaches a local maximum on one side and a local minimum on the other side. The
name of a saddle point is taken by that of a horse's saddle.
The name of local minima is because the value of the loss function is minimum at that point
in a local region. In contrast, the name of the global minima is given so because the value of
the loss function is minimum there, globally across the entire domain the loss function.
In a deep neural network, if the model is trained with gradient descent and backpropagation,
there can occur two more issues other than local minima and saddle point.
Vanishing Gradient occurs when the gradient is smaller than expected. During
backpropagation, this gradient becomes smaller that causing the decrease in the learning rate
of earlier layers than the later layer of the network. Once this happens, the weight parameters
update until they become insignificant.
Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient
is too large and creates a stable model. Further, in this scenario, model weight increases, and
they will be represented as NaN. This problem can be solved using the dimensionality
reduction technique, which helps to minimize complexity within the model.
Here the nodes marked as “1” are known as bias units. The leftmost layer or Layer
1 is the input layer, the middle layer or Layer 2 is the hidden layer and the rightmost
layer or Layer 3 is the output layer. It can say that the above diagram has 3 input
units (leaving the bias unit), 1 output unit, and 4 hidden units(1 bias unit is not
included).
Suppose we have xn inputs(x1, x2….xn) and a bias unit. Let the weight applied to be
w1, w2…..wn. Then find the summation and bias unit on performing dot product
among inputs and weights as:
On feeding the r into activation function F(r) we find the output for the hidden layers.
For the first hidden layer h1, the neuron can be calculated as:
h11 = F(r)
For all the other hidden layers repeat the same procedure. Keep repeating the process
until reach the last weight set.
152
Lecture: 28
Figure 2 depicts the network components which affect a particular weight change.
Notice that all the necessary components are locally related to the weight being
updated. This is one feature of backpropagation that seems biologically plausible.
However, brain connections appear to be unidirectional and not bidirectional as
would be required to implement backpropagation.
153
4.6.1 Notation
For the purpose of this derivation, we will use the following notation:
Figure 2: The change to a hidden to output weight depends on error (depicted as a lined
pattern) at the output node and activation (depicted as a solid pattern) at the hidden node.
While the change to a input to hidden weight depends on error at the hidden node (which in
turn depends on error at all the output nodes) and activation at the input node.
154
4.7 Review of Calculus Rules
155
Let’s consider each of these partial derivatives in turn. Note that only one term of the E
summation will have a non-zero derivative: the one associated with the particular weight we
are considering.
We’d like to be able to rewrite this result in terms of the activation function. Notice
that:
Using this fact, we can rewrite the result of the partial derivative as:
156
4.7.5 Weight change rule for a hidden to output weight
Now substituting these results back into our original equation, we have:
Now we have to determine the appropriate weight change for an input to hidden weight.
This is more complicated because it depends on the error at all of the nodes this weighted
connection can lead to.
157
158
159
160
161
Lecture: 29
4.8 Generalization
Generalization is a fundamental concept in machine learning (ML) and artificial
intelligence (AI). It refers to a model's capacity to function well with fresh, previously
unknown data that was not part of the training dataset. Generalization rules in AI enable
models to make correct predictions and judgments based on the information gathered
from training data. These criteria ensure that models learn the underlying patterns and
relationships in the data rather than memorizing individual samples. By focusing on
generalization, AI models can apply what they've learnt to a variety of settings, increasing
their efficacy and reliability.
4.8.1 Difference Between Memorization and Generalization
When a model learns training data so well that it performs very well on it but is unable to
apply this knowledge to fresh data, this is known as memorization. On the other hand, a
well-generalizing model can deduce and forecast results for data points it hasn't seen in
training.
4.8.2 Generalization vs. Overfitting
When a model learns sufficiently from the noise and details in the training set that it
becomes unreliable on data that is new, this is known as overfitting. Since the objective
of generalization is to develop models that continue to perform well on seen and unseen
data, this is a crucial problem.
4.8.3 Theoretical Foundations of Generalization
The architecture of the Self Organizing Map with two clusters and n input features of any
sample is given below:
Let’s say an input data of size (m, n) where m is the number of training examples and n
is the number of features in each example. First, it initializes the weights of size (n, C)
where C is the number of clusters. Then iterating over the input data, for each training
example, it updates the winning vector (weight vector with the shortest distance (e.g
Euclidean distance) from training example). Weight updation rule is given by :
where alpha is a learning rate at time t, j denotes the winning vector, i denotes the
ith feature of training example and k denotes the kth training example from the input data.
After training the SOM network, trained weights are used for clustering new examples.
163
A new example falls in the cluster of winning vectors.
4.9.2 Algorithm
Training:
i. Step 1: Initialize the weights wij random value may be assumed. Initialize the
learning rate α.
iii. Step 3: Find index J, when D(j) is minimum that will be considered as winning
index.
iv. Step 4: For each j within a specific neighborhood of j and for all i, calculate the
new weight.
i. α(t+1) = 0.5 * t
164
Lecture: 30
4.10 Convolutional Neural Network (CNN)
A Convolutional Neural Network (CNN), also known as ConvNet, is a specialized type of deep
learning algorithm mainly designed for tasks that necessitate object recognition, including image
classification, detection, and segmentation. CNNs are employed in a variety of practical
scenarios, such as autonomous vehicles, security camera systems, and others.
CNNs are distinguished from classic machine learning algorithms such as SVMs and decision
trees by their ability to autonomously extract features at a large scale, bypassing the need for
manual feature engineering and thereby enhancing efficiency.
Beyond image classification tasks, CNNs are versatile and can be applied to a range of other
domains, such as natural language processing, time series analysis, and speech recognition.
4.10.2 Inspiration Behind CNN and Parallels With The Human Visual System
Convolutional neural networks were inspired by the layered architecture of the human visual
cortex, and below are some key similarities and differences:
165
Figure 4.11: Biologically inspired CNN
i. Hierarchical architecture: Both CNNs and the visual cortex have a hierarchical
structure, with simple features extracted in early layers and more complex features built
up in deeper layers. This allows increasingly sophisticated representations of visual
inputs.
ii. Local connectivity: Neurons in the visual cortex only connect to a local region of the
input, not the entire visual field. Similarly, the neurons in a CNN layer are only connected
to a local region of the input volume through the convolution operation. This local
connectivity enables efficiency.
iii. Translation invariance: Visual cortex neurons can detect features regardless of their
location in the visual field. Pooling layers in a CNN provide a degree of translation
invariance by summarizing local features.
iv. Multiple feature maps: At each stage of visual processing, there are many different
feature maps extracted. CNNs mimic this through multiple filter maps in each convolution
layer.
v. Non-linearity: Neurons in the visual cortex exhibit non-linear response properties. CNNs
achieve non-linearity through activation functions like ReLU applied after each
convolution.
vi. CNNs mimic the human visual system but are simpler, lacking its complex feedback
mechanisms and relying on supervised learning rather than unsupervised, driving
advances in computer vision despite these differences.
166
4.10.3 Key Components of a CNN
The convolutional neural network is made of four main parts.
They help the CNNs mimic how the human brain operates to recognize patterns and features in
images:
Convolutional layers
Rectified Linear Unit (ReLU for short)
Pooling layers
Fully connected layers
This section dives into the definition of each one of these components through the example of
the following example of classification of a handwritten digit.
167
Lecture: 31
This is the first building block of a CNN. As the name suggests, the main mathematical task
performed is called convolution, which is the application of a sliding window function to a matrix
of pixels representing an image. The sliding function applied to the matrix is called kernel or
filter, and both can be used interchangeably.
In the convolution layer, several filters of equal size are applied, and each filter is used to
recognize a specific pattern from the image, such as the curving of the digits, the edges, the whole
shape of the digits, and more.
Put simply, in the convolution layer, we use small grids (called filters or kernels) that move over
the image. Each small grid is like a mini magnifying glass that looks for specific patterns in the
photo, like lines, curves, or shapes. As it moves across the photo, it creates a new grid that
highlights where it found these patterns.
For example, one filter might be good at finding straight lines, another might find curves, and so
on. By using several different filters, the CNN can get a good idea of all the different patterns
that make up the image.
Let’s consider this 32x32 grayscale image of a handwritten digit. The values in the matrix are
given for illustration purposes.
Figure 4.13: Illustration of the input image and its pixel representation
Also, let’s consider the kernel used for the convolution. It is a matrix with a dimension of 3x3.
The weights of each element of the kernel is represented in the grid. Zero weights are represented
in the black grids and ones in the white grid.
168
4.11.1 Do we have to manually find these weights?
In real life, the weights of the kernels are determined during the training process of the neural
network.
Using these two matrices, we can perform the convolution operation by applying the dot product,
and work as follows:
i. Apply the kernel matrix from the top-left corner to the right.
ii. Perform element-wise multiplication.
iii. Sum the values of the products.
iv. The resulting value corresponds to the first value (top-left corner) in the convoluted matrix.
v. Move the kernel down with respect to the size of the sliding window.
vi. Repeat steps 1 to 5 until the image matrix is fully covered.
vii. The dimension of the convoluted matrix depends on the size of the sliding window. The higher
the sliding window, the smaller the dimension.
Figure 4.14: Application of the convolution task using a stride of 1 with 3x3 kernel
Another name associated with the kernel in the literature is feature detector because the weights
can be fine-tuned to detect specific features in the input image.
For instance:
i. Averaging neighboring pixels kernel can be used to blur the input image.
ii. Subtracting neighboring kernel is used to perform edge detection.
169
The more convolution layers the network has, the better the layer is at detecting more abstract
features.
Figure 4.15: Application of max pooling with a stride of 2 using a 2x2 filter
Also, the dimension of the feature map becomes smaller as the pooling function is applied.
The last pooling layer flattens its feature map so that it can be processed by the fully connected
layer.
170
4.11.4 Fully connected layers
These layers are in the last layer of the convolutional neural network, and their inputs correspond
to the flattened one-dimensional matrix generated by the last pooling layer. ReLU activations
functions are applied to them for non-linearity.
Finally, a softmax prediction layer is used to generate probability values for each of the possible
output labels, and the final label predicted is the one with the highest probability score.
This can be observed when the performance on training data is too low compared to the
performance on validation or testing data, and a graphical illustration is given below:
Deep learning models, especially Convolutional Neural Networks (CNNs), are particularly
susceptible to overfitting due to their capacity for high complexity and their ability to learn
detailed patterns in large-scale data.
171
Lecture: 32
I. Dropout: This consists of randomly dropping some neurons during the training process,
which forces the remaining neurons to learn new features from the input data.
II. Batch normalization: The overfitting is reduced at some extent by normalizing the input layer
by adjusting and scaling the activations. This approach is also used to speed up and stabilize
the training process.
III. Pooling Layers: This can be used to reduce the spatial dimensions of the input image to
provide the model with an abstracted form of representation, hence reducing the chance of
overfitting.
IV. Early stopping: This consists of consistently monitoring the model’s performance on
validation data during the training process and stopping the training whenever the validation
error does not improve anymore.
V. Noise injection: This process consists of adding noise to the inputs or the outputs of hidden
layers during the training to make the model more robust and prevent it from a weak
generalization.
VI. L1 and L2 normalizations: Both L1 and L2 are used to add a penalty to the loss function
based on the size of weights. More specifically, L1 encourages the weights to be spare, leading
to better feature selection. On the other hand, L2 (also called weight decay) encourages the
weights to be small, preventing them from having too much influence on the predictions.
172
VII. Data augmentation: This is the process of artificially increasing the size and diversity of the
training dataset by applying random transformations like rotation, scaling, flipping, or
cropping to the input images.
Image classification: Convolutional neural networks are used for image categorization, where
images are assigned to predefined categories. One use of such a scenario is automatic photo
organization in social media platforms.
Object detection: CNNs are able to identify and locate multiple objects within an image. This
capability is crucial in multiple scenarios of shelf scanning in retail to identify out-of-stock items.
Facial recognition: this is also one of the main industries of application of CNNs. For instance,
this technology can be embedded into security systems for efficient control of access based on
facial features.
173
4.12 Important Questions (PYQs)
Q1: Explain different layers of CNN (Convolutional network) with suitable examples.
Q2: What is Self-Organizing Map (SOM)? Explain the stages and steps in SOM Algorithm.
Q4: What are Neural Networks? What are the types of Neural Networks?
Q8: Discuss the role of Activation function in neural networks. Also discuss various types
of activation functions with formulas and diagrams.
Q9: Describe Artificial Neural Networks (ANN) with different Layers and its
characteristics.
Q10: What are the Advantages and Disadvantages of ANN? Explain the application areas
of ANN?
Q12: Explain different types of Gradient Descent with advantages and disadvantages.
Q17: Discuss selection of various parameters in Back propagation Neural Network (BPN)
and its effects.
174
UNIT 5 - Reinforcement Learning
REINFORCEMENT LEARNING
Lecture: 33
175
ii. Environment (): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
iii. Action (): Actions are the moves taken by an agent within the environment.
iv. State (): State is a situation returned by the environment after each action taken by the
agent.
v. Reward (): A feedback returned to the agent from the environment to evaluate the action
of the agent.
vi. Policy (): Policy is a strategy applied by the agent for the next action based on the current
state.
vii. Value (): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.
viii. Q-value (): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).
The agent takes the next action and changes states according to the feedback of the previous
action.
The environment is stochastic, and the agent needs to explore it to reach to get the maximum
positive rewards.
5.4 Approaches to implement Reinforcement Learning
There are mainly three ways to implement reinforcement-learning in ML, which are:
5.4.1 Value-based:
The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π..
176
5.4.2 Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards without
using the value function. In this approach, the agent tries to apply such a policy that the action
performed in each step helps to maximize the future reward. The policy- based approach has
mainly two types of policy:
i. Deterministic: The same action is produced by the policy (π) at any state. o Stochastic:
In this policy, probability determines the produced action.
ii. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different for
each environment.
Policy
Reward Signal
Value Function
Model of the environment
I. Policy: A policy can be defined as a way how an agent behaves at a given time. It maps
the perceived states of the environment to the actions taken on those states. A policy is
the core element of the RL as it alone can define the behavior of the agent. In some cases,
it may be a simple function or a lookup table, whereas, for other cases, it may involve
general computation as a search process. It could be deterministic or a stochastic policy:
II. Reward Signal: The goal of reinforcement learning is defined by the reward signal. At
each state, the environment sends an immediate signal to the learning agent, and this
signal is known as a reward signal. These rewards are given according to the good and
bad actions taken by the agent. The agent's main objective is to maximize the total number
of rewards for good actions. The reward signal can change the policy, such as if an action
selected by the agent leads to low reward, then the policy may change to select other
actions in the future.
177
III. Value Function: The value function gives information about how good the situation and
action are and how much reward an agent can expect. A reward indicates the immediate
signal for each good and bad action, whereas a value function specifies the good state and
action for the future. The value function depends on the reward as, without reward, there
could be no value. The goal of estimating values is to achieve more rewards.
IV. Model: The last element of reinforcement learning is the model, which mimics the
behavior of the environment. With the help of the model, one can make inferences about
how the environment will behave. Such as, if a state and an action are given, then a model
can predict the next state and reward.
V. The model is used for planning, which means it provides a way to take a course of action
by considering all future situations before actually experiencing those situations. The
approaches for solving the RL problems with the help of the model are termed as the
model-based approach. Comparatively, an approach without using a model is called a
model-free approach.
Supervised Learning:
i. In supervised learning. The decision is made on the initial input or the input given at the start.
ii. Supervised learning decisions are independent of so labels are given to each decision.
178
iii. Example: Object recognition.
i. In passive learning, the agent's policy π is fixed. In states, it always executes the action π
(s).
ii. Its goal is simply to learn how good the policy is -that is, to learn the utility function U n(s).
iii. Fig. 5.7.1 shows a policy for the world and the corresponding utilities.
iv. In Fig. 5.7.1(a) the policy happens to be optimal with rewards of R(s) = - 0.04 in the non-
terminal states and no discounting.
v. Passive learning agent does not know the transition model T(s, a, s'), which specifies the
probability of reaching state s' from state’s after doing action a; nor does it know the reward
function R(s) which specifies the reward for each state.
vi. The agent executes a set of trials in the environment using its policy π.
vii. In each trial, the agent starts in state (1, 1) and experiences a sequence of state transitions
until it reaches one of the terminal states, (4, 2) or (4, 3).
viii. Its percept’s supply both the current state and the reward received in that state. Typical
trials might look like this.
ix. (1,1)-0.04 (1,2)-0.04 (1,3)-0.04 (1,2)-0.04 (1,3)-0.04 (2,3)-0.04 (3,3)-0.04 (4,3)+1
x. (1,1)-0.04 (1,2)-0.04 (1,3)-0.04 (2,3)-0.04 (3,3)-0.04 (3,2)-0.04 (3,3)-0.04 (4,3)+1
xi. (1,1)-0.04 (1,2)-0.04 (3,1)-0.04 (3,2)-0.04 (4,2)+1
+1
˄ -1
˄
˄
1 2 3 4
Fig. 5.7.1(a) A policy n for the 4x3 world
xi. Each state percept is subscripted with the reward received. The object is to use the
information about rewards to learn the expected utility Uπ(s) associated with each non-
terminal state s.
xii. The utility is defined to be the expected sum of (discounted) rewards obtained if policy π
is followed:
179
II. Active reinforcement learning:
180
b. The RL agent's objective is to maximize the total reward it receives in the long run.
c. It defines good and bad events.
d. It cannot be altered by the agent but may inform change of policy.
e. It can be probabilistic (expected reward).
3. Value function (V):
a. It defines the total amount of reward an agent can expect to accumulate over the future, starting
from that state.
b. A state may yield a low reward but have a high value (or the opposite). For example, immediate
pain/pleasure vs. long term happiness.
4. Transition model (M):
a. It defines the transitions in the environment action a taken in the states, will lead to state s b. It
can be probabilistic.
181
v. Classification: Classification task is related with predicting a category of a data (discrete
variables). Most common example is predicting whether or not an email is spam or not, whether a
person is suffering from a particular disease or not, whether a transaction is fraud or not, etc.
vi. Clustering: Clustering tasks are all about finding natural groupings of data and a label associated
with each of these groupings (clusters).
vii. Some of the common examples includes customer segmentation, product features identification for
product roadmap.
viii. Multivariate querying: Multivariate querying is about querying or finding similar objects.
ix. Density estimation: Density estimation problems are related with finding likelihood or frequency
of objects.
x. Dimension reduction: Dimension reduction is the process of reducing the number of random
variables under consideration, and can be divided into feature selection and feature extraction.
xi. Model algorithm/ selection: Many a times, there are multiple models which are trained using
different algorithms. One of the important tasks is to select most optimal models for deploying
them in production.
xii. Testing and matching: Testing and matching tasks relates to comparing data sets.
182
Lecture: 34
183
5.10.3 Approaches used to implement reinforcement learning algorithm.
There are three approaches used implement a reinforcement learning algorithm:
1. Value-Based:
a. In a value-based reinforcement learning method, we should try to maximize a value function V(s).
In this method, the agent is expecting a long-term return of the current states under policy π.
2. Policy-based:
In a policy-based RL method, we try to come up with such a policy that the action performed in
every state helps you to gain maximum reward in the future.
Two types of policy-based methods are:
i. Deterministic: For any state, the same action is produced by the policy
ii. Stochastic: Every action has a certain probability, which is determined by the following equation
stochastic policy:
n (a/s) = P / A = a/S = S
3. Model-Based:
a. In this Reinforcement Learning method, we need to create a virtual model for each environment.
b. The agent learns to perform in that specific environment.
184
Lecture: 35
a) Reinforcement learning is defined by a specific type of problem, and all its solutions are classed as
reinforcement learning algorithms.
b) In the problem, an agent is supposed to decide the best action to select based on his current state.
c) When this step is repeated, the problem is known as a Markov Decision Process.
d) A Markov Decision Process (MDP) model contains:
i) A State is a set of tokens that represent every state that the agent can be in.
ii) A Model (sometimes called Transition Model) gives an action's effect in a state. In particular,
T(S, a, S') defines a transition T where being in state S and taking an action 'a' takes us to state
S' (S and S' may be same).
iii) An Action A is set of all possible actions. A(s) defines the set of actions that can be taken being
in state S.
iv) A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the
state S. R(S,a) indicates the reward for being in a state S and taking an action 'a'. R(S,a,S 1)
indicates the reward for being in a state S, taking an action 'a' and ending up in a state S'.
v) A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It
indicates the action 'a' to be taken while in state S.
(1) We cannot apply reinforcement learning model in all situation. Following are the conditions
when we should not use reinforcement learning model.
(2) When we have enough data to solve the problem with a supervised learning method.
(3) When the action space is large reinforcement learning is computing heavy and time-
consuming.
(4) Challenges we will face while doing reinforcement learning are:
1. Feature/reward design which should be very involved.
2. Parameters may affect the speed of learning.
3. Realistic environments can have partial observability.
4. Too much reinforcement may lead to an overload of states which can diminish
the results.
5. Realistic environments can be non-stationary.
5.12 Q-learning
185
v. On the other hand, an on-policy learner learns the value of the policy being carried out by
the agent, including the exploration steps and it will find a policy that is optimal, taking into
account the exploration inherent in the policy.
Step 1: Initialize the Q-table: First the Q-table has to be built. There are n columns, where n =
number of actions. There are m rows, where m = number of states.
In our example n = Go left, Go right, Go up and Go down and m = Start, Idle, Correct path, Wrong
path and End. First, lets initialize the value at 0.
Step 3: Perform an action: The combination of steps 2 and 3 is performed for an undefined
amount of time. These steps run until the time training is stopped, or when the training loop
stopped as defined in the code.
a. First, an action (a) in the state (s) is chosen based on the Q-table. Note that, when the episode
initially starts, every Q-value should be 0.
b. Then, update the Q-values for being at the start and moving right using the Bellman equation.
Step 4: Measure reward: Now we have taken an action and observed an outcome and reward.
186
Lecture: 36
187
5.14.2 Pseudo code for deep Q-learning.
Start with Q0 (S , a) for all S, a.
Get initial state S
For k = 1,2,... till convergence
Sample action a, get next state S'
If S1 is terminal:
target = R(S, a, S')
Sample new initial state S'
else target = R(S, a, S') + γ max Qk (S', a')
ɵk+1 ɵk - α Δ0 E*-P (s1 |s,a)[(Q0 (S,a) – target (S1))2 ]
S S1
188
Lecture: 37
i. Genetic algorithms are computerized search and optimization algorithm based on mechanics
of natural genetics and natural selection.
ii. These algorithms mimic the principle of natural genetics and natural selection to construct
search and optimization procedure.
iii. Genetic algorithms convert the design space into genetic space. Design space is a set of feasible
solutions.
iv. Genetic algorithms work with a coding of variables.
v. The advantage of working with a coding of variables space is that coding discretizes the search
space even though the function may be continuous.
vi. Search space is the space for all possible feasible solutions of particular problem.
vii. Following are the benefits of Genetic algorithm:
a. They are robust.
b. They provide optimization over large space state.
c. They do not break on slight change in input or presence of noise.
viii. Following are the application of Genetic algorithm:
a. Recurrent neural network
b. Mutation testing
c. Code breaking
d. Filtering and signal processing
e. Learning fuzzy rule base
189
Lecture: 38
3. Selection:
a. The idea of selection phase is to select the fittest individuals and let them pass their genes to the
next generation.
b. Two pairs of individuals (parents) are selected based on their fitness scores.
c. Individuals with high fitness have more chance to be selected for reproduction.
4. Crossover:
a. Crossover is the most significant phase in a genetic algorithm.
b. For each pair of parents to be mated, a crossover point is chosen at random from within the genes.
c. For example, consider the crossover point to be 3.
d. Offspring are created by exchanging the genes of parents among themselves until the crossover
point is reached.
e. The new offspring are added to the population.
5. Mutation:
a. When new offspring formed, some of their genes can be subjected to a mutation with a low
random probability.
b. This implies that some of the bits in the bit string can be flipped.
a. Mutation occurs to maintain diversity within the population and prevent premature convergence.
6. Termination:
a. The algorithm terminates if the population has converged (does not produce offspring which are
significantly different from the previous generation).
b. Then it is said that the genetic algorithm has provided a set of solutions to our problem.
190
Lecture: 39
5.17 Mutation
Mutation Operator is a unary operator and it needs only one parent to work on. It does so
by selecting a few genes from our selected chromosome and apply the desired algorithm.
Five Mutation Algorithms for string manipulation –
I. Bit Flip Mutation
II. Random Resetting Mutation
III. Swap Mutation
IV. Scramble Mutation
V. Inversion Mutation
Bit Flip Mutation is mainly used for bit string manipulation while others can be used for
any
kind of strings. Here our chromosome will be represented as an array and each index will
represent one gene. Strings can be represented as an array of characters which in turn is
an array of ASCII or numeric values.
1) Bit Flip Mutation —
In bit flip mutation, we select one or more genes (array indices) and flip their values i.e.
we change 1s to 0s and vice versa. It is better explained using the given diagram.
3) Swap Mutation —
In Swap Mutation we select two genes from our chromosome and interchange their values.
4) Scramble Mutation —
In Scramble Mutation we select a subset of our genes and scramble their value. The
selected genes may not be contiguous (see the second diagram).
191
5) Inversion Mutation —
In Inversion Mutation we select a subset of our genes and reverse their order. The genes
have to be contiguous in this case (see the diagram).
192
5.19 Types of encoding in Genetic Algorithm
Genetic representations:
I. Encoding:
a. Encoding is a process of representing individual genes.
b. The process can be performed using bits, numbers, trees, arrays, lists or any other
objects.
c. The encoding depends mainly on solving the problem.
1. Binary encoding:
a. Binary encoding is the most commonly used method of genetic C representation
because GA uses this type of encoding.
b. In binary encoding, every chromosome is a string of bits, 0 or 1.
c. Chromosome A 101100101100101011100101
d. Chromosome B 111111100000110000011111
e. Binary encoding gives many possible chromosomes.
4. Value encoding:
a. Direct value encoding can be used in problems, where some complicated
values, such as real numbers, are used.
b. In value encoding, every chromosome is a string of some values.
c. Values can be anything connected to problem, real numbers or chars to some
complicated objects.
Chromosome A 1.2324 5.3243 0.4556 2.3293 2.4545
Chromosome B ABDJEIFJDHDIERJFDLDFLFEGT
Chromosome C (back), (back), (right), (forward), (left)
5. Tree encoding:
a. Tree encoding is used for evolving programs or expressions, for genetic
programming.
b. In tree encoding, every chromosome is a tree of some objects, such as functions
or commands in programming language.
193
c. Programming language LISP is often used to this, because programs in it are
represented in this form and can be easily parsed as a tree, so the cross-over
and mutation can be done relatively easily.
194
ii. The worst will have fitness 1, the next 2, .., and the best will have fitness N (N is the
number of chromosomes in the population).
iii. The method can lead to slow convergence because the best chromosome does not differ
so much from the other.
e. Steady-state selection:
i. The main idea of the selection is that bigger part of chromosome should survive to next
generation.
ii. GA works in the following way:
1. In every generation a few chromosomes are selected for creating new off springs.
2. Then, some chromosomes are removed and new offspring is placed in that place.
3. The rest of population survives a new generation.
195
Lecture: 40
1. Optimization: Genetic Algorithms are most commonly used in optimization problems wherein
we have to maximize or minimize a given objective function value under a given set of
constraints.
2. Economics: GAs are also used to characterize various economic models like the cobweb
model, game theory equilibrium resolution, asset pricing. Etc.
3. Neural networks: GAs is also used to train neural networks, particularly recurrent neural
networks.
4. Parallelization: GAs also have very good parallel capabilities, and prove to be very effective
means in solving certain problems, and also provide a good area for research.
5. Image processing: GAs is used for various digital image processing (DIP) Tasks as well like
dense pixel matching.
6. Machine learning: Genetics based machine learning (GBML) is a nice area in machine
learning.
7. Robot trajectory generation: GAs has been used to plan the path which a robot arm takes by
moving from one point to another.
196
Lecture: 41
Q7: Explain various types of reinforcement learning techniques with suitable example.
Q9: What are the different types and elements of Reinforcement Learning?
Q13: Describe Q-Learning Algorithm Process and steps involved in Deep Q-Learning Network.
Q14: Explain different phases of Genetic Algorithm with advantages and disadvantages.
Q17: Explain different methods of selection in Genetic Algorithm in order to select a population
for next generation.
198
Meerut Institute of Engineering & Technology, Meerut
NH-58, Bypass Road, Baghpat Crossing, Meerut 250 005, U.P., INDIA