0% found this document useful (0 votes)
270 views209 pages

Machine Learning Techniques Course Outline

The document outlines the course content for 'Machine Learning Techniques' (BCAI601, BCDS-062) at Meerut Institute of Engineering & Technology for B.Tech 3rd Year students. It includes the vision and mission of the institute, evaluation schemes, and detailed syllabus covering various machine learning topics such as regression, decision trees, neural networks, and reinforcement learning. Additionally, it specifies course outcomes and teaching plans for the semester.

Uploaded by

timepass29714
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
270 views209 pages

Machine Learning Techniques Course Outline

The document outlines the course content for 'Machine Learning Techniques' (BCAI601, BCDS-062) at Meerut Institute of Engineering & Technology for B.Tech 3rd Year students. It includes the vision and mission of the institute, evaluation schemes, and detailed syllabus covering various machine learning topics such as regression, decision trees, neural networks, and reinforcement learning. Additionally, it specifies course outcomes and teaching plans for the semester.

Uploaded by

timepass29714
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning Techniques

BCAI-601, BCDS-062
rd
[Link]-3 Year

Machine
Learning
Techniques

Meerut Institute of Engineering & Technology, Meerut


MEERUT INSTITUTE OF ENGINEERING &
TECHNOLOGY MEERUT

Course Content
for
Machine Learning Techniques
(BCAI601, BCD-062)
B. Tech 3rd Year
CSE(AI)/CSE(AI & ML)/CSE (DS)/ CSE(IoT)

DR. A.P.J. ABDUL KALAM TECHNICAL UNIVERSITY


LUCKNOW
Even Semester: 2024-25

i
Vision of Institute

To be an outstanding institution in the country imparting technical education,


providing need-based, value-based and career-based programmes and producing
self-reliant, self- sufficient technocrats capable of meeting new challenges.

Mission of Institute

The mission of the institute is to educate young aspirants in various technical fields
to fulfill global requirement of human resources by providing sustainable quality
education, training and invigorating environment besides molding them into skilled
competent and socially responsible citizens who will lead the building of a powerful
nation.

ii
EVALUATION SCHEME

SI. Subject Periods Evaluation Scheme End Total Credit


No. Subject Semester
Codes
L T p CT TA Total PS TE PE
Software Engineering 3 1 0 20 10 30 70 100 4
BCS601
3 1 0 20 10 30 70 100 4
2 BCAI601 Machine Leaming
Techniques
3 Computer Networks 3 1 0 20 10 30 70 100 4
BCS603
BCAI061/ 3 0 0 20 10 30 70 100 3
4 BCDS061/ Departmental Elective-III
BCAM061/
BCAM062
3 0 0 20 10 30 70 100 3
5 Open Elective-I Open Elective-I

6 Software Engineering Lab 0 0 2 50 50 100 1


BCS651
7 Machine Leaming Lab 0 0 2 50 50 100 1
BCAI651
8 Computer Networks Lab 0 0 2 50 50 100 1
BCS653
Constitution of India/
BNC601/ Essence of Indian
9 2 0 0 20 10 30 70
BNC602 Traditional Knowledge
Total 800 21

* The Mini Project or Internship (4 weeks) will be done during summer break after VI Semester and will be
assessed during VII semester.
* It is desirable that the students should do their Summer Internship or Mini Project in their specialization
area in line with the [Link]. program.

iii
SEMESTER- VI

Departmental Elective-I
1. BCAI051 - Mathematical Foundation AI, ML and Data Science
2. BCS058 - Data Warehouse & Data Mining
3. BCS052 - Data Analytics
4. BCS054 - Object Oriented System Design with C++
Departmental Elective-II
1. BCAM05 l - Cloud Computing
2. BCAI052 - Natural Language Processing
3. BCS056 -Application of Soft Computing
4. BCS057- Image Processing
Departmental Elective-ill
1. BCAI06 l - Cyber Forensic analytics
2. BCDS061 - Image Analytics
3. BCAM061 - Social Media Analytics and Data Analysis
4. BCAM062 - Stream Processing and Analytics

iv
BCAI601 MACHINE LEARNING TECHNIQUES
Course Outcome (CO) Bloom's Knowledge Level (KL)

At the end of course, the student will be able to

To understand the need for machine learning for various problem solving K1, K2
COl
CO2 To understand a wide variety of learning algorithms and how to evaluate models K1, K3
generated from data
To understand the latest trends in machine learning K2, K3
CO3
To design appropriate machine learning algorithms and apply the algorithms to a K4, K6
CO4 real-world problem
CO5 To optimize the models learned and report on the expected accuracy that can be K4, K5
achieved by applying the models
3-0-0
DETAILED SYLLABUS
Unit Topic Proposed
Lecture
I INTRODUCTION - Learning, Types of Learning, Well defined learning problems,
Designing a Learning System, History of ML, Introduction of Machine Leaming
Approaches - (Artificial Neural Network, Clustering, Reinforcement Leaming, 08
Decision Tree Learning, Bayesian networks, Support Vector Machine, Genetic
Algorithm), Issues in Machine Leaming and Data Science Vs Machine Leaming;
II REGRESSION: Linear Regression and Logistic Regression
BAYESIAN LEARNING- Bayes theorem, Concept learning, Bayes Optimal
Classifier, Na1ve Bayes classifier, Bayesian belief networks, EM algorithm.
SUPPORT VECTOR MACHINE: Introduction, Types of support vector kernel - 08
(Linear kernel, polynomial kernel and Gaussian kernel), Hyperplane - (Decision
surface), Properties of SVM, and Issues in SVM.
III DECISION TREE LEARNING - Decision tree learning algorithm, Inductive bias,
Inductive inference with decision trees, Entropy and information theory,
Information gain, ID-3 Algorithm, Issues in Decision tree learning. 08
INSTANCE-BASED LEARNING - k-Nearest Neighbour Leaming, Locally
Weighted Regression, Radial basis function networks, Case-based learning
IV ARTIFICIAL NEURAL NETWORKS - Perceptron's, Multilayer perceptron,
Gradient descent and the Delta rule, Multilayer networks, Derivation of
Backpropagation Algorithm, Generalization, Unsupervised Leaming - SOM
Algorithm and its variant; 08
DEEP LEARNING - Introduction, concept of convolutional neural network, Types
of layers - (Convolutional Layers, Activation function, pooling, fully connected),
Concept of Convolution (1D and 2D) layers, Training of network, Case study of
CNN for eg., on Diabetic Retinopathy, Building a smart speaker, Self-deriving car
etc.

v
V REINFORCEMENT LEARNING-Introduction to Reinforcement Leaming,
Leaming Task, Example of Reinforcement Learning in Practice, Learning Models
for Reinforcement - (Markov Decision process, Q Leaming - Q Leaming function,
08
Q Leaming Algorithm), Application of Reinforcement Learning, Introduction to
Deep Q Learning.
GENETIC ALGORITHMS: Introduction, Components, GA cycle ofreproduction,
Crossover, Mutation, Genetic Programming, Models of Evolution and Leaming,
Applications.

Text books and References:

1. Tom M. Mitchell, -Machine Leaming, McGraw-Hill Education (India) Private Limited,


2013.

2. EthemAlpaydin, -Introduction to Machine Leaming (Adaptive Computation and


Machine Leaming), MIT Press

3. Stephen Marsland, -Machine Leaming: An Algorithmic Perspective, CRC Press, 2009.

4. Bishop, C., Pattern Recognition and Machine Leaming. Berlin: Springer-Verlag.

5. M. Gopal, "Applied Machine Leaming", McGraw Hill Education

vi
Meerut Institute of Engineering & Technology, Meerut
Lesson Plan / Teaching Plan / Lecture Plan with Progress:
B. Tech – VI Semester: 2024-25

Course Name MLT


Code BCAI601
Faculty

Topics / lectures are arranged in sequence - same - as to be taught in the class. Maintain data related to "Date"
in its hard copy.

Reference &
Lecture Teaching
S. No
CO Topic Description
Pedagogy
Material
No (No)
.
White
1 1 CO1 Learning, Types of Learning, Well defined learning
Board
1,2
problems
White
2 2 CO1 Designing a Learning System,
Board
History of ML White
3 3 CO1
Board
1,2
PPT, White
4 4 CO1 Introduction of Machine Learning Approaches –
Board
1
Artificial Neural Network,
Clustering, Reinforcement Learning
White
5 5 CO1
Board
1

PPT, White
6 6 CO1 Decision Tree Learning, Bayesian networks,
Board
1,2

Support Vector Machine, Genetic Algorithm PPT, White


7 7 CO1
Board
1
PPT, White
8 8 CO1 Issues in Machine Learning and Data Science Vs
Board
1,2
Machine Learning
Linear Regression and Logistic Regression PPT, White
9 9 CO2
Board
1

PPT, White
10 10 CO2 Bayes theorem, Concept learning, Bayes Optimal
Board
1
Classifier
PPT, White
11 11 CO2 Naïve Bayes classifier, Bayesian belief networks
Board
1,2

PPT, White
12 12 CO2 EM algorithm
Board
1,2,3
PPT, White
13 13 CO2 SUPPORT VECTOR MACHINE: Introduction
Board
1,2

Types of support vector kernel – (Linear kernel, PPT, White


14 14 CO2
polynomial kernel and Gaussian kernel) Board
1,2

PPT, White
15 15 CO2 Hyperplane – (Decision surface)
Board
1,2

vii
PPT, White
16 16 CO2 Properties of SVM, and Issues in SVM
Board
1,2,3
PPT, White
17 17 CO3 Decision tree learning algorithm
Board
1,2,3
PPT, White
18 18 CO3 Inductive bias, Inductive inference with decision
Board
1,2,3
trees

PPT, White
19 19 CO3 Entropy and information theory, Information gain
Board
1,2

PPT, White
20 20 CO3 ID-3 Algorithm, Issues in Decision tree learning
Board
1,2,3

PPT, White
21 21 CO3 k-Nearest Neighbour Learning
Board
1,2,3

PPT, White
22 22 CO3 Locally Weighted Regression
Board
1,2

PPT, White
23 23 CO3 Radial basis function networks
Board
1,2,3
PPT, White
24 24 CO3 Case-based learning
Board
1,2,3
PPT, White
25 25 CO4 Perceptron’s, Multilayer perceptron
Board
1

Gradient descent and the Delta rule, PPT, White


26 26 CO4
Board
1,2

Multilayer networks PPT, White


27 27 CO4
Board
1,2,3

PPT, White
28 28 CO4 Derivation of Backpropagation Algorithm
Board
1,2
Generalization, Unsupervised Learning – SOM PPT, White
29 29 CO4
Algorithm and its variant Board
1,2

PPT, White
30 30 CO4 Introduction, concept of convolutional neural
Board
1,2,3
network, Types of layers – (Convolutional Layers,
Activation function, pooling, fully connected)
Concept of Convolution (1D and 2D) layers,
PPT, White
31 31 CO4 Training of network
Board
1,2

Case study of CNN for eg., on Diabetic PPT, White


32 32 CO4
Retinopathy, Building a smart speaker, Board
1,2
Self-deriving car etc
–Introduction to Reinforcement Learning, PPT, White
33 33 CO5
Learning Task Board
1,2
PPT, White
34 34 CO5 Example of Reinforcement Learning in Practice
Board
1,2,3
Learning Models for Reinforcement – (Markov
Decision process, Q Learning - Q Learning PPT, White
35 35 CO5
Board
1
function, Q Learning Algorithm)
PPT, White
36 36 CO5 Application of Reinforcement Learning,
Board
1
Introduction to Deep Q Learning
GENETIC ALGORITHMS: Introduction, PPT, White
37 37 CO5
Components Board
1,2

viii
PPT, White
38 38 CO5 GA cycle of reproduction, Crossover
Board
1,2
Mutation, Genetic Programming PPT, White
39 39 CO5
Board
1,2
Models of Evolution and Learning, PPT, White
40 40 CO5
Applications Board
1,2
Industrial Applications & Case Studies PPT, White
41 41 CO5
Board
1,2

ix
Table of Content

Table of Content .......................................................................................................................................... 1


UNIT 1 – Introduction to ML ...................................................................................................................... 7
1.1 Machine learning ................................................................................................................................... 7
1.2 What is Machine Learning? ................................................................................................................... 7
1.3 Features of Machine learning ................................................................................................................ 7
1.4 Types of Machine Learning ................................................................................................................... 8
1.4.1 Supervised Machine Learning ......................................................................................................... 8
1.4.2 Unsupervised Machine Learning ............................................................................................ 10
1.4.4 Categories of Reinforcement Learning......................................................................................... 14
1.5 Designing a Learning System in Machine Learning ........................................................................... 15
1.5.1 Following are the qualities that you need to keep in mind while Designing a learning
system:.................................................................................................................................................... 16
1.6 History of Machine Learning ............................................................................................................... 18
1.6.1 The early history of Machine Learning (Pre-1940): .................................................................... 18
1.6.2 Computer machinery and intelligence: ......................................................................................... 18
1.7 Artificial Neural Network .................................................................................................................... 21
1.7.1 What is Artificial Neural Network?.............................................................................................. 21
1.7.2 Relationship between Biological and artificial neural network: .................................................. 21
1.7.3 The Architecture of an artificial neural network:.......................................................................... 22
1.7.4 How do artificial neural networks work?...................................................................................... 25
1.8 Clustering ............................................................................................................................................. 27
1.8.1 Types of Clustering Methods ........................................................................................................ 28
1.8.2 Clustering Algorithms ................................................................................................................... 31
1.8.3 Applications of Clustering ............................................................................................................. 31
1.9 Decision Tree Classification Algorithm ............................................................................................... 33
1.9.1 Why use Decision Trees? .............................................................................................................. 34
1.9.2 Decision Tree Terminologies ........................................................................................................ 34
1.9.3 How does the Decision Tree algorithm Work? ............................................................................ 34
1.9.4 Attribute Selection Measures ........................................................................................................ 35
1.9.5 Pruning: Getting an Optimal Decision tree ................................................................................... 37
1.10 Bayesian Belief Network in artificial intelligence ............................................................................. 38
1
1.10.1 Joint probability distribution: ...................................................................................................... 39
1.10.2 Explanation of Bayesian network: .............................................................................................. 40
1.10.3 The semantics of Bayesian Network: ......................................................................................... 43
1.11 Support Vector Machine Algorithm ................................................................................................... 44
1.11.1 Types of SVM ............................................................................................................................. 45
1.11.2 Hyperplane and Support Vectors in the SVM algorithm: .......................................................... 45
1.11.3 How does SVM works? .............................................................................................................. 46
1.12 Genetic Algorithm in Machine Learning ........................................................................................... 49
1.12.1 Selection ...................................................................................................................................... 51
1.12.2 How Genetic Algorithm Work?.................................................................................................. 51
1.12.2 Mutation ...................................................................................................................................... 53
1.13 Issues in Machine Learning ................................................................................................................ 56
1.13.1 Common issues in Machine Learning ........................................................................................ 56
1.13.2 Methods to remove Data Bias: ................................................................................................... 59
1.14 Difference between Data Science and Machine Learning: ............................................................... 60
1.15 Important Question (Previous Year Questions) ................................................................................. 62
UNIT 2- Regression, Bayesian Network, SVM......................................................................................... 63
2.1 Linear Regression........................................................................................................................ 63
2.2 Logistic Regression: ................................................................................................................. 65
2.3 Linear Regression vs Logistic Regression ............................................................................................ 66
2.3.1 Difference between Linear Regression and Logistic Regression: ............................................... 67
2.4 Bayes Theorem in Machine learning .................................................................................................... 69
2.4.1 Introduction to Bayes Theorem in Machine Learning .................................................................... 69
2.4.2 Prerequisites for Bayes Theorem .................................................................................................. 70
2.4.2 How to apply Bayes Theorem or Bayes rule in Machine Learning? ............................................ 74
2.5 Concept Learning in Machine Learning .............................................................................................. 74
2.5.1 A CONCEPT LEARNING TASK – Search ................................................................................. 77
2.5.2 General-to-Specific Ordering of Hypotheses ............................................................................... 79
2.6 Bayes Optimal Classifier and Naive Bayes Classifier ........................................................................ 80
2.7 What is Naïve Bayes Classifier in Machine Learning ......................................................................... 81
2.7.1 Advantages of Naïve Bayes Classifier in Machine Learning: ...................................................... 81
2.7.2 Disadvantages of Naïve Bayes Classifier in Machine Learning: ................................................. 81
2.8 Bayesian Belief Network in artificial intelligence ............................................................................... 81

2
2.9 EM Algorithm in Machine Learning.................................................................................................... 84
2.9.1 What is an EM algorithm? ............................................................................................................ 84
2.9.2 EM Algorithm ............................................................................................................................... 85
2.9.2 What is Convergence in the EM algorithm? ................................................................................ 86
2.9.3 Steps in EM Algorithm ................................................................................................................. 86
2.9.3 Gaussian Mixture Model (GMM) ................................................................................................. 87
2.9.4 Applications of EM algorithm ...................................................................................................... 87
2.10 Support Vector Machine Algorithm ................................................................................................... 89
2.11 Types of SVM .................................................................................................................................... 91
2.11.1 Kernel Method in SVMs ............................................................................................................. 92
2.11.2 Major Kernel Function in Support Vector Machine ................................................................... 93
2.12 Hyperplane and Support Vectors in the SVM algorithm: ................................................................. 96
2.12.1 Support Vectors: .......................................................................................................................... 96
2.13 Properties of SVM ........................................................................................................................... 100
2.13.1 The Disadvantages of Support Vector Machine (SVM) are: ................................................... 100
2.14 Important Questions (Previous Year Question) ............................................................................... 102
UNIT 3 – Decision Tree Learning ........................................................................................................... 103
3.1 Decision Tree Classification Algorithm ............................................................................................ 103
3.1.1 Why use Decision Trees? ............................................................................................................ 104
3.1.2 How does the Decision Tree algorithm Work? ........................................................................... 105
3.1.3 Attribute Selection Measures ...................................................................................................... 106
3.2 Inductive Bias .................................................................................................................................... 108
3.3 Inductive inference with decision tree ............................................................................................... 108
3.4 What is Inductive Learning Algorithm? ............................................................................................ 108
3.5 Entropy and Information Gain ........................................................................................................... 110
3.6 What is Information Gain?................................................................................................................. 110
3.7 Key Differences between Entropy and Information Gain ................................................................. 111
3.8 ID3 Algorithm .................................................................................................................................... 113
3.9 k-NN Learning ................................................................................................................................... 125
3.10 Locally Weighted Regression .......................................................................................................... 128
3.11 Radial Basis Function Networks ...................................................................................................... 132
3.11.1 What are Radial Basis Functions? ............................................................................................. 132
3.11.2 How Do RBF Networks Work? ................................................................................................ 132

3
3.11.3 Key Characteristics of RBFs ..................................................................................................... 132
3.11.4 Architecture of RBF Networks ................................................................................................. 133
3.11.5 Training Process of radial basis function neural network ......................................................... 133
3.12 Case-Based Learning ....................................................................................................................... 135
3.12.1 Challenges with CBR ................................................................................................................ 135
3.13 Important Questions (PYQs)............................................................................................................ 136
UNIT 4- Artificial Neural Networks........................................................................................................ 137
4.1 Perceptron in Machine Learning ........................................................................................................ 137
4.1.1 What is the Perceptron model in Machine Learning? ................................................................. 137
4.1.2 What is Binary classifier in Machine Learning? ......................................................................... 138
4.1.3 Basic Components of Perceptron ................................................................................................ 138
4.2 How does Perceptron work? .............................................................................................................. 139
4.3 Types of Perceptron Models .............................................................................................................. 141
4.3.1 Advantages of Multi-Layer Perceptron: ...................................................................................... 142
4.3.2 Disadvantages of Multi-Layer Perceptron: ................................................................................. 142
4.4 Gradient Descent in Machine Learning ............................................................................................. 144
4.4.1 What is Gradient Descent or Steepest Descent? ......................................................................... 144
4.4.2 What is Cost-function? ................................................................................................................ 145
4.4.3 How does Gradient Descent work? ............................................................................................. 146
4.4.4 Direction & Learning Rate .......................................................................................................... 147
4.4.5 Learning Rate: ............................................................................................................................. 147
4.5.6 Types of Gradient Descent .......................................................................................................... 147
4.5 Multilayer Networks .......................................................................................................................... 151
4.5.1 Formula for Multi-Layered Neural Network .............................................................................. 152
4.6 Derivation of Backpropagation ......................................................................................................... 153
4.6.1 Notation ....................................................................................................................................... 154
4.7 Review of Calculus Rules .................................................................................................................. 155
4.7.1 Gradient Descent on Error........................................................................................................... 155
4.7.2 Derivative of the error with respect to the activation .................................................................. 156
4.7.3 Derivative of the activation with respect to the net input ........................................................... 156
4.7.4 Derivative of the net input with respect to a weight ................................................................... 156
4.7.5 Weight change rule for a hidden to output weight ...................................................................... 157
4.7.6 Weight change rule for an input to hidden weight ...................................................................... 157

4
4.8 Generalization .................................................................................................................................... 162
4.8.1 Difference Between Memorization and Generalization .............................................................. 162
4.8.2 Generalization vs. Overfitting ..................................................................................................... 162
4.8.3 Theoretical Foundations of Generalization ................................................................................. 162
4.9 Self Organizing Maps ........................................................................................................................ 162
4.9.1 How do SOM works? .................................................................................................................. 163
4.9.2 Algorithm .................................................................................................................................... 164
4.10 Convolutional Neural Network (CNN) ............................................................................................ 165
4.10.1 The importance of CNNs .......................................................................................................... 165
4.10.2 Inspiration Behind CNN and Parallels With The Human Visual System ................................. 165
4.10.3 Key Components of a CNN ...................................................................................................... 167
4.11 Convolution layers ........................................................................................................................... 168
4.11.1 Do we have to manually find these weights? ............................................................................ 169
4.11.2 Activation function .................................................................................................................... 170
4.12.3 Pooling layer ............................................................................................................................. 170
4.11.4 Fully connected layers ............................................................................................................... 171
4.11.5 Overfitting and Regularization in CNNs ................................................................................... 171
4.11.6 Seven strategies to mitigate overfitting in CNNs ...................................................................... 172
4.11.7 Practical Applications of CNNs ................................................................................................ 173
4.12 Important Questions (PYQs)............................................................................................................ 174
UNIT 5 - Reinforcement Learning .......................................................................................................... 175
5.1 What is Reinforcement Learning? ..................................................................................................... 175
5.2 Terms used in Reinforcement Learning ............................................................................................. 175
5.3 Key Features of Reinforcement Learning .......................................................................................... 176
5.4 Approaches to implement Reinforcement Learning .......................................................................... 176
5.4.1 Value-based: ................................................................................................................................ 176
5.4.2 Policy-based: ............................................................................................................................... 177
5.5 Elements of Reinforcement Learning ................................................................................................ 177
5.6 Reinforcement learning. ..................................................................................................................... 178
5.7 Differentiate between reinforcement and supervised learning. ......................................................... 178
5.8 Types of reinforcement learning: ....................................................................................................... 180
5.9 Different machine learning task. ........................................................................................................ 181
5.10 Reinforcement learning with the help of an example. ..................................................................... 183

5
5.10.1 Working of Reinforcement Learning: ....................................................................................... 183
5.10.2 Terms used in reinforcement learning method. ......................................................................... 183
5.10.3 Approaches used to implement reinforcement learning algorithm. .......................................... 184
5.11 Learning models of reinforcement learning ................................................................................. 185
5.11. 1 Challenges of reinforcement learning ...................................................................................... 185
5.12 Q-learning .................................................................................................................................... 185
5.12.1 Q- Learning algorithm ......................................................................................................... 186
5.13 Application of reinforcement learning ............................................................................................. 187
5.14 Describe deep Q-learning. ............................................................................................................... 187
5.14.1 Steps involved in reinforcement learning using deep Q-learning networks: ............................ 187
5.14.2 Pseudo code for deep Q-learning. ............................................................................................. 188
5.15 Genetic algorithm......................................................................................................................... 189
5.15.1 Procedure of Genetic algorithm: ............................................................................................... 189
5.15.2 Advantages of genetic algorithm: ............................................................................................. 189
5.15.3 Disadvantages of Genetic algorithm: ........................................................................................ 189
5.16 Cycle of genetic algorithm ........................................................................................................... 190
5.17 Mutation ....................................................................................................................................... 191
5.18 Genetic Programming ..................................................................................................................... 192
5.18.1 Key Components of Genetic Programming .............................................................................. 192
5.19 Types of encoding in Genetic Algorithm......................................................................................... 193
5.20 The various methods of selecting .................................................................................................... 194
5.21 Roulette-wheel based on fitness v/s Roulette-wheel based on rank ................................................ 196
5.21.1 Roulette-wheel based on fitness ................................................................................................ 196
5.21.2 Roulette-wheel based on Rank .................................................................................................. 196
5.22 Applications of genetic algorithms .................................................................................................. 196
5.23 Industrial Application ...................................................................................................................... 197
5.23.1 Optimization of travelling salesman problem using genetic algorithm .................................... 197
15.23.2 Convergence of genetic algorithm .......................................................................................... 197
5.24 Important Questions ......................................................................................................................... 198

6
UNIT 1 – Introduction to ML

1. INTRODUCTION

Learning, Types of Learning, Well-defined Learning Problems,


Designing a Learning
System
a. To Understand the basics of Machine Learning and types of Learning.
WHY b. To Understand the History of Machine Learning.

a. Implement various algorithms of Supervised, Unsupervised


and Reinforcement Machine Learnings.
WHAT
a. In the Selection of Datasets for various Machine Learning Problems.
WHERE b. Applications of Clustering and Classification.

Lecture: 1
1.1 Machine learning
(ML) is a subdomain of artificial intelligence (AI) that focuses on developing systems that
learn—or improve performance—based on the data they ingest. Artificial intelligence is a
broad word that refers to systems or machines that resemble human intelligence. Machine
learning and AI are frequently discussed together, and the terms are occasionally used
interchangeably, although they do not signify the same thing. A crucial distinction is that,
while all machine learning is AI, not all AI is machine learning.

1.2 What is Machine Learning?


Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed. ML is one of the most exciting technologies that one would
have ever come across. As it is evident from the name, it gives the computer that makes it
more similar to humans: The ability to learn. Machine learning is actively being used
today, perhaps in many more places than one would expect.

1.3 Features of Machine learning


 Machine learning is data driven technology. Large amount of data generated by
organizations on daily bases. So, by notable relationships in data, organizations make
better decisions.
 Machine can learn itself from past data and automatically improve.
 From the given dataset it detects various patterns on data.
7
 For the big organizations branding is important and it will become easier to target
relatable customer base.
 It is similar to data mining because it is also deals with the huge amount of data.
1.4 Types of Machine Learning
Machine learning is a subset of AI, which enables the machine to automatically learn from data,
improve performance from past experiences, and make predictions.
Machine learning contains a set of algorithms that work on a huge amount of data. Data is fed to
these algorithms to train them, and on the basis of training, they build the model & perform a
specific task.
These ML algorithms help to solve different business problems like Regression, Classification,
Forecasting, Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four types,
which are:
1 Supervised Machine Learning
2 Unsupervised Machine Learning
3 Semi-Supervised Machine Learning
4 Reinforcement Learning

Figure 1.1: Types of ML

In this topic, we will provide a detailed description of the types of Machine Learning along
with their respective algorithms.

1.4.1 Supervised Machine Learning


As its name suggests, Supervised machine learning is based on supervision. It means in
the supervised learning technique, we train the machines using the "labelled" dataset,
and based on the training, the machine predicts the output. Here, the labelled data
specifies that some of the inputs are already mapped to the output. More preciously,
we can say; first, we train the machine with the input and corresponding output, and then
we ask the machine to predict the output using the test dataset.

8
Let's understand supervised learning with an example. Suppose we have an input
dataset of cats and dog images. So, first, we will provide the training to the machine to
understand the images, such as the shape & size of the tail of cat and dog, Shape of eyes,
colour, height (dogs are taller, cats are smaller), etc. After completion of training, we
input the picture of a cat and ask the machine to identify the object and predict the
output. Now, the machine is well trained, so it will check all the features of the object,
such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will
put it in the Cat category. This is the process of how the machine identifies the objects
in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x)
with the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.
Categories of Supervised Machine Learning
Supervised machine learning can be classified into two types of problems, which are given
below:
o Classification
o Regression

• Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:
• Random Forest Algorithm
• Decision Tree Algorithm
• Logistic Regression Algorithm
• Support Vector Machine Algorithm

• Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
Some popular Regression algorithms are given below:

• Simple Linear Regression Algorithm


• Multivariate Regression Algorithm
• Decision Tree Algorithm

9
• Lasso Regression

 Advantages:
1. Since supervised learning work with the labelled dataset so we can have an exact idea
about the classes of objects.
2. These algorithms are helpful in predicting the output on the basis of prior
experience.
 Disadvantages:
1. These algorithms are not able to solve complex tasks.
2. It may predict the wrong output if the test data is different from the training data.
3. It requires lots of computational time to train the algorithm.

 Applications of Supervised Learning


Some common applications of Supervised Learning are given below:
i. Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.
ii. Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is done by using
medical images and past labelled data with labels for disease conditions. With such a process, the
machine can identify a disease for the new patients.
iii. Fraud Detection - Supervised Learning classification algorithms are used for identifying
fraud transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.
iv. Spam detection - In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam
folder.
v. Speech Recognition - Supervised learning algorithms are also used in speech recognition.
The algorithm is trained with voice data, and various identifications can be done using the
same, such as voice-activated passwords, voice commands, etc.

1.4.2 Unsupervised Machine Learning


Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output
without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
10
The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to
find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally
unknown to the model, and the task of the machine is to find the patterns and categories
of the objects.
So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.

 Categories of Unsupervised Machine Learning


Unsupervised Learning can be further classified into two types, which are given below:
i. Clustering
ii. Association

 Clustering
The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the
most similarities remain in one group and have fewer or no similarities with the objects of
other groups. An example of the clustering algorithm is grouping the customers by their
purchasing behaviour.
Some of the popular clustering algorithms are given below:
o K-Means Clustering algorithm
o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

Association
Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset. The main aim of this learning algorithm
is to find the dependency of one data item on another data item and map those variables
accordingly so that it can generate maximum profit. This algorithm is mainly
applied in Market Basket analysis, Web usage mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.
Advantages and Disadvantages of Unsupervised Learning
Algorithm
Advantages:
11
o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.

Applications of Unsupervised Learning


o Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright
in document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised learning
techniques for building recommendation applications for different web applications and e-
commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised learning,
which can identify unusual data points within the dataset. It is used to discover fraudulent
transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract
particular information from the database. For example, extracting information of each
user located at a particular location.
2. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no
labelled training data) algorithms and uses the combination of labelled and unlabeled
datasets during the training period.
Although Semi-supervised learning is the middle ground between supervised and
unsupervised learning and operates on the data that consists of a few labels, it mostly
consists of unlabeled data. As labels are costly, but for corporate purposes, they may have
few labels. It is completely different from supervised and unsupervised learning as they
are based on the presence & absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning
algorithms, the concept of Semi-supervised learning is introduced.
The main aim of semi-supervised learning is to effectively use all the available data, rather
than only labelled data like in supervised learning. Initially, similar data is clustered along
with an unsupervised learning algorithm, and further, it helps to label the unlabeled data
into labelled data. It is because labelled data is a comparatively more expensive acquisition
than unlabeled data.

12
We can imagine these algorithms with an example. Supervised learning is where a
student is under the supervision of an instructor at home and college. Further, if that
student is self-analysing the same concept without any help from the instructor, it comes
under unsupervised learning. Under semi-supervised learning, the student has to revise
himself after analyzing the same concept under the guidance of an instructor at college.

Advantages and disadvantages of Semi-supervised Learning


Advantages:
o It is simple and easy to understand the algorithm.
o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.

Disadvantages:
o Iterations results may not be stable.
o We cannot apply these algorithms to network-level data.
o Accuracy is low.

1.4.3 Reinforcement Learning

Reinforcement learning works on a feedback-based process, in which an AI agent (A


software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance.
Agent gets rewarded for each good action and get punished for each bad action; hence
the goal of reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and
agents learn from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement
learning is to play a game, where the Game is the environment, moves of an agent at each
step define states, and the goal of the agent is to get a high score. Agent receives feedback
in terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields
such as Game theory, Operation Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision
Process(MDP). In MDP, the agent constantly interacts with the environment and
performs actions; at each action, the environment responds and generates a new state .

13
1.4.4 Categories of Reinforcement Learning
Reinforcement learning is categorized mainly into two types of methods/algorithms:
o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing
the tendency that the required behaviour would occur again by adding something. It
enhances the strength of the behaviour of the agent and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour would
occur again by avoiding the negative condition.

 Real-world Use cases of Reinforcement Learning


o Video Games:
RL algorithm are much popular in gaming applications. It is used to gain super-human
performance. Some popular games hat use RL algorithm are AlphaGo and AlphaGo Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that how
to use RL in computer to automatically learn and schedule resources to wait for different
jobs in order to minimize average job slowdown.

o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial
and manufacturing area, and these robots are made more powerful with reinforcement
learning. There are different industries that have their vision of building intelligent
robots using AI and Machine learning technology .

o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the
help of Reinforcement Learning by Salesforce company.
I. Advantages
o It helps in solving complex real-world problems which are difficult to be solved by
general techniques.
o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
o Helps in achieving long term results.
II. Disadvantage
o RL algorithms are not preferred for simple problems.
o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can
weaken the results.
o The curse of dimensionality limits reinforcement learning for real physical systems.
14
Lecture: 2

1.5 Designing a Learning System in Machine Learning


o Designing a learning system in machine learning requires careful consideration of
several key factors, including the type of data being used, the desired outcome, and the
available resources. In this article, we will explore the key steps involved in designing
a learning system in machine learning and discuss some best practices to keep in mind .
o The first step in designing a learning system in machine learning is to identify the type
of data that will be used. This can include structured data, such as numerical and
categorical data, as well as unstructured data, such as text and images. The type of data
will determine the type of machine learning algorithms that can be used and the
preprocessing steps required.
o Once the data has been identified, the next step is to determine the desired outcome
of the learning system. This can include classifying data, making predictions, or
identifying patterns in the data. The desired outcome will determine the type of machine
learning algorithm that should be used, as well as the evaluation metrics that will be
used to measure the performance of the learning system.
o Next, the resources available for the learning system must be considered. This includes the
amount of data available, the computational power available, and the amount of time
available to train the model. These resources will determine the complexity of the machine
learning algorithm that can be used and the amount of data that can be used for training.
o Once the data, desired outcome, and resources have been identified, it is time to select a
machine-learning algorithm and begin the training process. Decision trees, SVMs, and
neural networks are examples of common algorithms. It is crucial to assess the
effectiveness of the learning system using the right assessment measures, such as recall,
accuracy, and precision.
o After the learning system is trained, it is important to fine-tune the model by adjusting the
parameters and hyperparameters. This can be done using techniques such as cross-
validation and grid search. The final model should be tested on a hold-out test set to
evaluate its performance on unseen data.
o When constructing a machine learning system, there are some other recommended
practices to bear in mind in addition to these essential processes. A crucial factor to
take into account is making sure that the training data are indicative of the data that will
be encountered in the actual world. To do this, the data may be divided into training,
validation, and test sets.
o Another best practice is to use appropriate regularization techniques to prevent overfitting.
This can include techniques such as L1 and L2 regularization and dropout. It is also
important to use feature scaling and normalization to ensure that the data is in a format
that is suitable for the machine learning algorithm being used.

15
1.5.1 Following are the qualities that you need to keep in mind while
Designing a learning system:
i. Reliability
The system must be capable of carrying out the proper task at the appropriate degree of
performance in a given setting. Testing the dependability of ML systems that learn
from data is challenging because a system's failure needs not result in an error; instead, it
could simply produce garbage results, meaning that some results were produced even
though the system had not been trained with the corresponding ground truth.
When a typical system fails, you receive an error message, such as the crew is addressing
a technical issue and will return soon.
When a machine learning (ML) system fails, it usually does so without being seen. For
instance, when translating from English to Hindi or vice versa, even if the model may not
have seen all of the words, it may nevertheless provide a translation that is illogical.

ii. Scalability
There should be practical methods for coping with the system's expansion as it changes (in
terms of data amount, traffic volume, or complexity). Because certain essential applications
might lose millions of dollars or their credibility with just one hour of outage or
failure, there should be an automated provision to grow computing and storage capacity.
For instance, if a feature on an e-commerce website fails to function as planned on a busy
day, it might result in a loss of millions of dollars in sales.

iii. Maintainability
The performance of the model may fluctuate as a result of changes in data distribution
over time. In the ML system, there should be a provision to first determine whether there
is any model drift or data drift, and once the major drift is noticed, how to re-train/re-
fresh and enable new ML models without interfering with the ML system's present
functioning.
iv. Adaptability
The availability of fresh data with increased features or changes in business objectives,
such as conversion rate vs. customer engagement time for e-commerce, are the other
changes that occur most frequently in machine learning (ML) systems. As a result, the
system has to be adaptable to fast upgrades without causing any service disruptions.

16
 Data
i. For example, human age and height have expected value ranges, but they can't be too
huge, like age value 150+, height - 10 feet, etc. Feature expectations are recorded in
a schema - ranges of the feature values carefully captured to avoid any unanticipated value,
which can result in a trash answer.
ii. All features are advantageous; features introduced to the system should be valuable in
some way, such as being a predictor or an identifier, as each feature has a handling
cost.
iii. No feature should cost more than it is worth; each new feature should be evaluated in
terms of cost vs. benefits in order to eliminate those that would be difficult to implement
or manage.
iv. The data pipeline has the necessary privacy protections in place; for instance, personally
identifiable information (PII) should be managed carefully because any leaking of sensitive
information may have legal repercussions.
v. If any new external component has an influence on the system, it will be easier to introduce
new features to boost system performance.
vi. All input feature code, including one-hot encoding/binning features and the handling
of unseen levels in one-hot encoded features, must be checked in order to avoid any
intermediate values from departing from the desired range.
 Model
1. Model specifications are evaluated and submitted; for quicker re-training, correct
versioning of the model learning code is required.
2. Correlation between offline and online metrics: Model metrics (log loss, mape, mse) should
be strongly associated with the application's goal, such as revenue/cost/time.
3. Hyperparameters like learning rates, the number of layers, the size of the layers, the
maximum depth, and regularization coefficients must be modified for the use case because
the selection of hyperparameter values can significantly affect the accuracy of predictions.
4. To support the most recent model in production, it is important to comprehend how
frequently to retrain models depending on changes in data distribution. Model staleness
has an influence that is known.
5. Simple linear models with high-level characteristics are a good starting point for functional
testing and doing cost-benefit analyses when compared to more complex models.
However, a simpler model is not always better.
6. Model performance must be assessed using adequately representative data to ensure
that model quality is satisfactory on significant data slices.

17
Lecture: 3
1.6 History of Machine Learning
Before some years (about 40-50 years), machine learning was science fiction, but today
it is the part of our daily life. Machine learning is making our day-to-day life easy from
self-driving cars to Amazon virtual assistant "Alexa". However, the idea behind machine
learning is so old and has a long history. Below some milestones are given which have
occurred in the history of machine learning:
1.6.1 The early history of Machine Learning (Pre-1940):
 1834: In 1834, Charles Babbage, the father of the computer, conceived a device
that could be programmed with punch cards. However, the machine was never
built, but all modern computers rely on its logical structure.
 1936: In 1936, Alan Turing gave a theory that how a machine can determine
and execute a set of instructions.
 The era of stored program computers:
 1940: In 1940, the first manually operated computer, "ENIAC" was invented,
which was the first electronic general-purpose computer. After that stored
program computer such as EDSAC in 1949 and EDVAC in 1951 were invented.
 1943: In 1943, a human neural network was modeled with an electrical circuit. In
1950, the scientists started applying their idea to work and analyzed how human
neurons might work.
1.6.2 Computer machinery and intelligence:
 1950: In 1950, Alan Turing published a seminal paper, "Computer Machinery and
Intelligence," on the topic of artificial intelligence. In his paper, he asked, "Can
machines think?"

Machine intelligence in Games:


 1952: Arthur Samuel, who was the pioneer of machine learning, created a program
that helped an IBM computer to play a checkers game. It performed better more
it played.
 1959: In 1959, the term "Machine Learning" was first coined by Arthur Samuel.
The first "AI" winter:
 The duration of 1974 to 1980 was the tough time for AI and ML researchers,
and this duration was called as AI winter.
 In this duration, failure of machine translation occurred, and people had reduced
their interest from AI, which led to reduced funding by the government to the
researches.

18
Machine Learning from theory to reality
 1959: In 1959, the first neural network was applied to a real-world problem to
remove echoes over phone lines using an adaptive filter.
 1985: In 1985, Terry Sejnowski and Charles Rosenberg invented a neural network
NETtalk, which was able to teach itself how to correctly pronounce 20,000 words
in one week.
 1997: The IBM's Deep blue intelligent computer won the chess game against the
chess expert Garry Kasparov, and it became the first computer which had beaten a
human chess expert.
Machine Learning at 21st century
 2006:
Geoffrey Hinton and his group presented the idea of profound getting the hang of
utilizing profound conviction organizations.
The Elastic Compute Cloud (EC2) was launched by Amazon to provide scalable
computing resources that made it easier to create and implement machine
learning models.
 2007:
Participants were tasked with increasing the accuracy of Netflix's recommendation
algorithm when the Netflix Prize competition began.
Support learning made critical progress when a group of specialists utilized it
to prepare a PC to play backgammon at a top-notch level.
 2008:
Google delivered the Google Forecast Programming interface, a cloud-based help
that permitted designers to integrate AI into their applications.
Confined Boltzmann Machines (RBMs), a kind of generative brain organization,
acquired consideration for their capacity to demonstrate complex information
conveyances.
 2009:
Profound learning gained ground as analysts showed its viability in different
errands, including discourse acknowledgment and picture grouping.
The expression "Large Information" acquired ubiquity, featuring the difficulties
and open doors related with taking care of huge datasets.
 2010:
The ImageNet Huge Scope Visual Acknowledgment Challenge (ILSVRC) was
presented, driving progressions in PC vision, and prompting the advancement of
profound convolutional brain organizations (CNNs).
 2011:
On Jeopardy! IBM's Watson defeated human champions., demonstrating the
19
potential of question-answering systems and natural language processing.
 2012:
AlexNet, a profound CNN created by Alex Krizhevsky, won the ILSVRC,
fundamentally further developing picture order precision and laying out profound
advancing as a predominant methodology in PC vision.
Google's Cerebrum project, drove by Andrew Ng and Jeff Dignitary, utilized
profound figuring out how to prepare a brain organization to perceive felines
from unlabeled YouTube recordings.
 2013:
Ian Goodfellow introduced generative adversarial networks (GANs), which made it
possible to create realistic synthetic data.
Google later acquired the startup DeepMind Technologies, which focused on deep
learning and artificial intelligence.
 2014:
Facebook presented the DeepFace framework, which accomplished close human
precision in facial acknowledgment.
AlphaGo, a program created by DeepMind at Google, defeated a world champion
Go player and demonstrated the potential of reinforcement learning in challenging
games.
 2015:
Microsoft delivered the Mental Toolbox (previously known as CNTK), an open-
source profound learning library.
The performance of sequence-to-sequence models in tasks like machine
translation was enhanced by the introduction of the idea of attention mechanisms.
 2016:
The goal of explainable AI, which focuses on making machine learning models
easier to understand, received some attention.
Google's DeepMind created AlphaGo Zero, which accomplished godlike Go
abilities to play without human information, utilizing just support learning.
 2017:
Move learning acquired noticeable quality, permitting pretrained models to be
utilized for different errands with restricted information.
Better synthesis and generation of complex data were made possible by the
introduction of generative models like variational autoencoders (VAEs) and
Wasserstein GANs.
These are only a portion of the eminent headways and achievements in AI during
the predefined period. The field kept on advancing quickly past 2017, with new leap
forwards, strategies, and applications arising.

20
Lecture: 4

1.7 Artificial Neural Network


The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain. An Artificial neural network is usually a
computational network based on biological neural networks that construct the structure
of the human brain. Similar to a human brain has neurons interconnected to each other,
artificial neural networks also have neurons that are linked to each other in various layers
of the networks. These neurons are known as nodes.
1.7.1 What is Artificial Neural Network?
The term "Artificial Neural Network" is derived from Biological neural networks that
develop the structure of a human brain. Similar to the human brain that has neurons
interconnected to one another, artificial neural networks also have neurons that are
interconnected to one another in various layers of the networks. These neurons are known
as nodes.

Figure 1.0.1: Architecture of an ANN

Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks, cell
nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
1.7.2 Relationship between Biological and artificial neural network:

Table 1.1: BNN v/s ANN

Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

21
Axon Output

An Artificial Neural Network in the field of Artificial intelligence where it attempts to


mimic the network of neurons makes up a human brain so that computers will have an
option to understand things and make decisions in a human-like manner. The artificial
neural network is designed by programming computers to behave simply like
interconnected brain cells.
There are around 1000 billion neurons in the human brain. Each neuron has an association
point somewhere in the range of 1,000 and 100,000. In the human brain, data is stored
in such a manner as to be distributed, and we can extract more than one piece of this data
when necessary from our memory parallelly. We can say that the human brain is made
up of incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example
of a digital logic gate that takes an input and gives an output. "OR" gate, which takes
two inputs. If one or both the inputs are "On," then we get "On" in output. If both the
inputs are "Off," then we get "Off" in output. Here the output depends upon input. Our
brain does not perform the same task. The outputs to inputs relationship keep changing
because of the neurons in our brain, which are "learning."
1.7.3 The Architecture of an artificial neural network:
To understand the concept of the architecture of an artificial neural network, we have
to understand what a neural network consists of. In order to define a neural network
that consists of a large number of artificial neurons, which are termed units arranged in
a sequence of layers. Lets us look at various types of layers available in an artificial
neural network.

22
Figure 1.0.2: Layered Representation of an ANN

Artificial Neural Network primarily consists of three layers:


a. Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
b. Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
c. Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.

It determines weighted total is passed as an input to an activation function to produce the


output. Activation functions choose whether a node should fire or not. Only those who are
fired make it to the output layer. There are distinctive activation functions available that
can be applied upon the sort of task we are performing.

 Advantages of Artificial Neural Network (ANN)

23
I. Parallel processing capability:
Artificial neural networks have a numerical value that can perform more than one task
simultaneously.
II. Storing data on the entire network:
Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't prevent
the network from working.
III. Capability to work with incomplete knowledge:
After ANN training, the information may produce output even with inadequate
data. The loss of performance here relies upon the significance of missing data.
IV. Having a memory distribution:
For ANN is to be able to adapt, it is important to determine the examples and to
encourage the network according to the desired output by demonstrating these
examples to the network. The succession of the network is directly proportional to
the chosen instances, and if the event can't appear to the network in all its aspects,
it can produce false output.
V. Having fault tolerance:
Extortion of one or more cells of ANN does not prohibit it from generating output,
and this feature makes the network fault-tolerance.
 Disadvantages of Artificial Neural Network:
I. Assurance of proper network structure:
There is no particular guideline for determining the structure of artificial neural
networks. The appropriate network structure is accomplished through experience,
trial, and error.
II. Unrecognized behavior of the network:
It is the most significant issue of ANN. When ANN produces a testing solution, it does
not provide insight concerning why and how. It decreases trust in the network.
III. Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their
structure. Therefore, the realization of the equipment is dependent.
IV. Difficulty of showing the issue to the network:
ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be resolved
here will directly impact the performance of the network. It relies on the user's abilities.
V. The duration of the network is unknown:
VI. The network is reduced to a specific value of the error, and this value does not give us
optimum results.

24
1.7.4 How do artificial neural networks work?

Figure 1.0.3: ANN Working

Artificial Neural Network can be best represented as a weighted directed graph, where the
artificial neurons form the nodes. The association between the neurons outputs and neuron
inputs can be viewed as the directed edges with weights. The Artificial Neural Network
receives the input signal from the external source in the form of a pattern and image in the
form of a vector. These inputs are then mathematically assigned by the notations x(n) for
every n number of inputs.
Afterward, each of the input is multiplied by its corresponding weights ( these weights
are the details utilized by the artificial neural networks to solve a specific problem ). In
general terms, these weights normally represent the strength of the interconnection
between neurons inside the artificial neural network. All the weighted inputs are
summarized inside the computing unit.
If the weighted sum is equal to zero, then bias is added to make the output non-zero or
something else to scale up to the system's response. Bias has the same input, and
weight equals to 1. Here the total of weighted inputs can be in the range of 0 to
positive infinity. Here, to keep the response in the limits of the desired value, a certain
maximum value is benchmarked, and the total of weighted inputs is passed through
the activation function.
The activation function refers to the set of transfer functions used to achieve the
desired output. There is a different kind of the activation function, but primarily either
linear or non-linear sets of functions. Some of the commonly used sets of activation
functions are the Binary, linear, and Tan hyperbolic sigmoidal activation functions.
Let us take a look at each of them in details:
Binary:

25
In binary activation function, the output is either a one or a 0. Here, to accomplish this,
there is a threshold value set up. If the net weighted input of neurons is more than 1, then
the final output of the activation function is returned as one or else the output is returned
as 0.
Sigmoidal Hyperbolic:

The Sigmoidal Hyperbola function is generally seen as an "S" shaped curve. Here the
tan hyperbolic function is used to approximate output from the actual net input. The
function is defined as:
F(x) = (1/1 + exp(-????x))

Where ???? is considered the Steepness parameter.


Types of Artificial Neural Network:
There are various types of Artificial Neural Networks (ANN) depending upon the human
brain neuron and network functions, an artificial neural network similarly performs
tasks. The majority of the artificial neural networks will have some similarities with
a more complex biological partner and are very effective at their expected tasks. For
example, segmentation or classification.
 Feedback ANN:

In this type of ANN, the output returns into the network to accomplish the best-
evolved results internally. The feedback networks feed information back into itself and
are well suited to solve optimization issues. The Internal system error corrections utilize
feedback ANNs.
 Feed-Forward ANN:

A feed-forward network is a basic neural network comprising of an input layer, an


output layer, and at least one layer of a neuron. Through assessment of its output by
reviewing its input, the intensity of the network can be noticed based on group
behavior of the associated neurons, and the output is decided. The primary advantage
of this network is that it figures out how to evaluate and recognize input patterns.

26
Lecture: 5

1.8 Clustering
Clustering or cluster analysis is a machine learning technique, which groups the unlabeled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a
group that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabeled dataset such as shape, size,
color, behavior, etc., and divides them as per the presence and absence of those similar
patterns.
It is an unsupervised learning method, hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.
The clustering technique is commonly used for statistical data analysis.
Note: Clustering is somewhere similar to the classification algorithm, but the difference
is the type of dataset that we are using. In classification, we work with the labeled data
set, whereas in clustering, we work with the unlabeled dataset.
Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
usage are grouped together. Such as the t-shirts are grouped in one section, and trousers
are at other sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc.,
are grouped in separate sections, so that we can easily find out the things. The
clustering technique also works in the same way. Other examples of clustering are
grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses
of this technique are:
I. Market Segmentation
II. Statistical data analysis
III. Social network analysis
IV. Image segmentation
V. Anomaly detection, etc.

27
The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.

Figure 1.0.1: Clustering Algorithm

1.8.1 Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But
there are also other various approaches of Clustering exist. Below are the main clustering
methods used in Machine learning:
I. Partitioning Clustering
II. Density-Based Clustering
III. Distribution Model-Based Clustering
IV. Hierarchical Clustering
V. Fuzzy Clustering

i. Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also known
as the centroid-based method. The most common example of partitioning clustering is the
K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as compared to another cluster centroid.

28
Figure 1.0.2: Partitioning Clustering

ii. Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected.
This algorithm does it by identifying different clusters in the dataset and connects the areas
of high densities into clusters. The dense areas in data space are divided from each other
by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.

Figure 1.0.3: DB- Clustering

29
iii. Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done
by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that
uses Gaussian Mixture Models (GMM).

Figure 1.0.4: DMB- Clustering

iv. Hierarchical Clustering


Hierarchical clustering can be used as an alternative for the partitioned clustering as there
is no requirement of pre-specifying the number of clusters to be created. In this technique,
the dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting
the tree at the correct level. The most common example of this method is the
Agglomerative Hierarchical algorithm.

Figure 1.0.5: Hierarchical Clustering

30
v. Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more
than one group or cluster. Each dataset has a set of membership coefficients, which depend
on the degree of membership to be in a cluster. Fuzzy C-means algorithm is the
example of this type of clustering; it is sometimes also known as the Fuzzy k-means
algorithm.
1.8.2 Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above.
There are different types of clustering algorithms published, but only a few are commonly
used. The clustering algorithm is based on the kind of data that we are using. Such
as, some algorithms need to guess the number of clusters in the given dataset, whereas
some are required to find the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in
machine learning:
i. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of equal
variances. The number of clusters must be specified in this algorithm. It is fast with fewer
computations required, with the linear complexity of O(n).
ii. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on updating
the candidates for centroid to be the center of the points within a given region.
iii. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications
with Noise. It is an example of a density-based model similar to the mean-shift, but with
some remarkable advantages. In this algorithm, the areas of high density are separated by
the areas of low density. Because of this, the clusters can be found in any arbitrary shape.
iv. Expectation-Maximization Clustering using GMM: This algorithm can be used as an
alternative for the k-means algorithm or for those cases where K-means can be failed.
In GMM, it is assumed that the data points are Gaussian distributed.
v. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can be
represented as a tree-structure.
vi. Affinity Propagation: It is different from other clustering algorithms as it does not require
to specify the number of clusters. In this, each data point sends a message between the pair
of data points until convergence. It has O(N2T) time complexity, which is the main
drawback of this algorithm.

1.8.3 Applications of Clustering


Below are some commonly known applications of clustering technique in Machine
Learning:

31
i. In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets into
different groups.
ii. In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.
iii. Customer Segmentation: It is used in market research to segment the customers based
on their choice and preferences.
iv. In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
v. In Land Use: The clustering technique is used in identifying the area of similar lands use
in the GIS database. This can be very useful to find that for what purpose the particular
land should be used, that means for which purpose it is more suitable.

32
Lecture: 6

1.9 Decision Tree Classification Algorithm


 Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
 The decisions or the test are performed on the basis of features of the given dataset.
 It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
 In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Figure 1.0.1: Decision Tree

33
1.9.1 Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for
the given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it
is easy to understand.
The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
1.9.2 Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

1.9.3 How does the Decision Tree algorithm Work?


In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and jumps
to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes. Step-
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify

34
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision tree starts
with the root node (Salary attribute by ASM). The root node splits further into the next
decision node (distance from the office) and one leaf node based on the corresponding
labels. The next decision node further gets split into one decision node (Cab facility) and
one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:

Figure 1.0.2: Example of Decision Tree

1.9.4 Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement,
we can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
i. Information Gain
ii. Gini Index

35
I. Information Gain:

Information gain is the measurement of changes in entropy after the segmentation of a


dataset based on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build the decision
tree.
A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated
using the below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
S= Total number of samples P(yes)= probability of yes P(no)= probability of no
II. Gini Index:

Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high Gini
index.
It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

36
1.9.5 Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types
of tree pruning technology used:
I. Cost Complexity Pruning
II. Reduced Error Pruning.

 Advantages of the Decision Tree

 It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other algorithms.

 Disadvantages of the Decision Tree

 The decision tree contains lots of layers, which makes it complex.


 It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
 For more class labels, the computational complexity of the decision tree may increase.

37
1.10 Bayesian Belief Network in artificial intelligence
Bayesian belief network is key computer technology for dealing with probabilistic events
and to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian
model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and
it consists of two parts:
 Directed Acyclic Graph
The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:

Figure 1.0.1: Directed Acyclic Graph

38
 Each node corresponds to the random variables, and a variable can be continuous or
discrete.
 Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in
the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
 In the above diagram, A, B, C, and D are random variables represented by the nodes of the
network graph.
 If we are considering node B, which is connected with node A by a directed arrow,
then node A is called the parent of Node B.
 Node C is independent of node A.

Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG.
The Bayesian network has mainly two components:

I. Causal Component
II. Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional
probability. So let's first understand the joint probability distribution:

1.10.1 Joint probability distribution:


If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.
= P[x1| x2, x3,....., xn]P[x2, x3, , xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn] P[xn-1|xn]P[xn].
In general for each variable Xi, we can write the equation as: P(Xi|Xi-1, , X1) = P(Xi
|Parents(Xi )).

39
1.10.2 Explanation of Bayesian network:
Let's understand the Bayesian network through an example by creating a directed
acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry
has two neighbors David and Sophia, who have taken a responsibility to inform Harry at
work when they hear the alarm. David always calls Harry when he hears the alarm, but
sometimes he got confused with the phone ringing and calls at that time too. On the other
hand, Sophia likes to listen to high music, so sometimes she misses to hear the alarm.
Here we would like to compute the probability of Burglary Alarm.
 Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an
earthquake occurred, and David and Sophia both called the Harry.
 Solution:
The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on alarm
probability.
The network is representing that our assumptions do not directly perceive the burglary
and also do not notice the minor earthquake, and they also not confer before calling.
The conditional distributions for each node are given as conditional probabilities table
or CPT.
Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values
 List of all events occurring in this network:

 Burglary (B)
 Earthquake(E)
 Alarm(A)
 David Calls(D)
 Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

40
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Figure 1.0.2: Directed Acyclic Graph

Let's take the observed probability for the Burglary and earthquake component: P(B=
True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake.
P(E= False)= 0.999, Which is the probability that an earthquake not occurred. We can
provide the conditional probabilities as per the below tables:
 Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:

41
Table 1.2: Matrix 1

B E P(A= True) P(A=


False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

 Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of
Alarm.

Table 1.3: Matrix 2

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

 Conditional probability table for Sophia Calls:


The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."

42
Table 1.4: Matrix 3

P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
1.10.3 The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which is
given below:
1. To understand the network as the representation of the Joint probability
distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional
independence statements.
It is helpful in designing inference procedure.

43
Lecture: 7

1.11 Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:

Figure 1.0.1: SVM

Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then
we test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will
see the extreme case of cat and dog. On the basis of the support vectors, it will classify it
as a cat. Consider the below diagram:

44
Figure 1.0.2: Prediction using SVM

SVM algorithm can be used for Face detection, image classification, text categorization,
etc.

1.11.1 Types of SVM


SVM can be of two types:
1. Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
2. Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

1.11.2 Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.

45
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
 Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.

1.11.3 How does SVM works?


 Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green
or blue. Consider the below image:

Figure 1.0.3: Linear SVM

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:

46
Figure 1.0.4: Hyperplane

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is
to maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

Figure 1.0.5: Construction of Hyperplane

 Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:

47
Figure 1.0.6: Non-Linear SVM

So, to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third-
dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

Figure 1.0.7: Adding Dimension

o now, SVM will divide the datasets into classes in the following way. Consider the below
image:

48
Figure 1.0.8: Drawing Hyperplane

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Figure 1.0.9: Best Hyperplane

Hence, we get a circumference of radius 1 in case of non-linear data.

1.12 Genetic Algorithm in Machine Learning


A genetic algorithm is an adaptive heuristic search algorithm inspired by "Darwin's theory
of evolution in Nature." It is used to solve optimization problems in machine learning. It
is one of the important algorithms as it helps solve complex problems that would take
a long time to solve.
49
Genetic Algorithms are being widely used in different real-world applications, for
example, Designing electronic circuits, code-breaking, image processing, and artificial
creativity.
In this topic, we will explain Genetic algorithm in detail, including basic terminologies
used in Genetic algorithm, how it works, advantages and limitations of genetic
algorithm, etc.
Before understanding the Genetic algorithm, let's first
understand basic terminologies to better understand this algorithm:
 Population: Population is the subset of all possible or probable solutions, which can
solve the given problem.
 Chromosomes: A chromosome is one of the solutions in the population for the
given problem, and the collection of gene generate a chromosome.
 Gene: A chromosome is divided into a different gene, or it is an element of
the chromosome.
 Allele: Allele is the value provided to the gene within a particular chromosome.
 Fitness Function: The fitness function is used to determine the individual's fitness
level in the population. It means the ability of an individual to compete with other
individuals. In every iteration, individuals are evaluated based on their fitness
function.
 Genetic Operators: In a genetic algorithm, the best individual mate to regenerate
offspring better than parents. Here genetic operators play a role in changing the
genetic composition of the next generation.

50
1.12.1 Selection
After calculating the fitness of every existent in the population, a selection process is used
to determine which of the individualities in the population will get to reproduce and
produce the seed that will form the coming generation.
Types of selection styles available
i. Roulette wheel selection
ii. Event selection
iii. Rank- grounded selection
So, now we can define a genetic algorithm as a heuristic search algorithm to solve
optimization problems. It is a subset of evolutionary algorithms, which is used in
computing. A genetic algorithm uses genetic and natural selection concepts to solve
optimization problems.

1.12.2 How Genetic Algorithm Work?


The genetic algorithm works on the evolutionary generational cycle to generate high-
quality solutions. These algorithms use different operations that either enhance or
replace the population to give an improved fit solution.
It basically involves five phases to solve the complex optimization problems, which are
given as below:
i. Initialization
ii. Fitness Assignment
iii. Selection
iv. Reproduction
v. Termination

I. Initialization

The process of a genetic algorithm starts by generating the set of individuals, which is
called population. Here each individual is the solution for the given problem. An
individual contains or is characterized by a set of parameters called Genes. Genes are
combined into a string and generate chromosomes, which is the solution to the problem.
One of the most popular techniques for initialization is the use of random binary strings.

51
Figure 1.0.1: Genetic Components

II. Fitness Assignment

Fitness function is used to determine how fit an individual is? It means the ability of an
individual to compete with other individuals. In every iteration, individuals are evaluated
based on their fitness function. The fitness function provides a fitness score to each
individual. This score further determines the probability of being selected for
reproduction. The high the fitness score, the more chances of getting selected for
reproduction.
III. Selection

The selection phase involves the selection of individuals for the reproduction of offspring.
All the selected individuals are then arranged in a pair of two to increase reproduction.
Then these individuals transfer their genes to the next generation.
There are three types of Selection methods available, which are:
1. Roulette wheel selection
2. Tournament selection
3. Rank-based selection

IV. Reproduction

After the selection process, the creation of a child occurs in the reproduction step. In
this step, the genetic algorithm uses two variation operators that are applied to the
parent population. The two operators involved in the reproduction phase are given below:
 Crossover: The crossover plays a most significant role in the reproduction phase of the
genetic algorithm. In this process, a crossover point is selected

52
at random within the genes. Then the crossover operator swaps genetic information of two
parents from the current generation to produce a new individual representing the offspring.

Figure 1.0.2: Crossover

The genes of parents are exchanged among themselves until the crossover point is met.
These newly generated offspring are added to the population. This process is also called
or crossover. Types of crossover styles available:
 One point crossover
 Two-point crossover
 Livery crossover
 Inheritable Algorithms crossover

1.12.2 Mutation
The mutation operator inserts random genes in the offspring (new child) to maintain
the diversity in the population. It can be done by flipping some bits in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances
diversification. The below image shows the mutation process:
Types of mutation styles available,
 Flip bit mutation
 Gaussian mutation
 Exchange/Swap mutation

Figure 1.0.3: Mutation

53
V. Termination
After the reproduction phase, a stopping criterion is applied as a base for termination.
The algorithm terminates after the threshold fitness solution is reached. It will identify
the final solution as the best solution in the population.
 General Workflow of a Simple Genetic Algorithm

Figure 1.0.4: Flow Chart

 Advantages of Genetic Algorithm


i. The parallel capabilities of genetic algorithms are best.
ii. It helps in optimizing various problems such as discrete functions, multi- objective
problems, and continuous functions.
iii. It provides a solution for a problem that improves over time.
iv. A genetic algorithm does not need derivative information.

54
 Limitations of Genetic Algorithms

i. Genetic algorithms are not efficient algorithms for solving simple problems.
ii. It does not guarantee the quality of the final solution to a problem.
iii. Repetitive calculation of fitness values may generate some computational challenges.
Difference between Genetic Algorithms and Traditional Algorithms

i. A search space is the set of all possible solutions to the problem. In the traditional
algorithm, only one set of solutions is maintained, whereas, in a genetic algorithm, several
sets of solutions in search space can be used.
ii. Traditional algorithms need more information in order to perform a search, whereas genetic
algorithms need only one objective function to calculate the fitness of an individual.
iii. Traditional Algorithms cannot work parallelly, whereas genetic Algorithms can work
parallelly (calculating the fitness of the individualities are independent).
iv. One big difference in genetic Algorithms is that rather of operating directly on seeker
results, inheritable algorithms operate on their representations (or rendering), frequently
appertained to as chromosomes.
v. One of the big differences between traditional algorithm and genetic algorithm is that it
does not directly operate on candidate solutions.
vi. Traditional Algorithms can only generate one result in the end, whereas Genetic
Algorithms can generate multiple optimal results from different generations.
vii. The traditional algorithm is not more likely to generate optimal results, whereas Genetic
algorithms do not guarantee to generate optimal global results, but also there is a great
possibility of getting the optimal result for a problem as it uses genetic operators such as
Crossover and Mutation.
viii. Traditional algorithms are deterministic in nature, whereas Genetic algorithms are
probabilistic and stochastic in nature.

55
Lecture: 8

1.13 Issues in Machine Learning


"Machine Learning" is one of the most popular technology among all data scientists and
machine learning enthusiasts. It is the most effective Artificial Intelligence technology that
helps create automated learning systems to take future decisions without being constantly
programmed. It can be considered an algorithm that automatically constructs various
computer software using past experience and training data. It can be seen in every
industry, such as healthcare, education, finance, automobile, marketing, shipping,
infrastructure, automation, etc. Almost all big companies like Amazon, Facebook, Google,
Adobe, etc., are using various machine learning techniques to grow their businesses. But
everything in this world has bright as well as dark sides. Similarly, Machine Learning
offers great opportunities, but some issues need to be solved.
This article will discuss some major practical issues and their business implementation,
and how we can overcome them. So let's start with a quick introduction to Machine
Learning.
1.13.1 Common issues in Machine Learning
Although machine learning is being used in every industry and helps organizations make
more informed and data-driven choices that are more effective than classical
methodologies, it still has so many problems that cannot be ignored. Here are some
common issues in Machine Learning that professionals face to inculcate ML skills
and create an application from scratch.
i. Inadequate Training Data
The major issue that comes while using machine learning algorithms is the lack of
quality as well as quantity of data. Although data plays a vital role in the processing of
machine learning algorithms, many data scientists claim that inadequate data, noisy
data, and unclean data are extremely exhausting the machine learning algorithms. For
example, a simple task requires thousands of sample data, and an advanced task such as
speech or image recognition needs millions of sample data examples. Further, data quality
is also important for the algorithms to work ideally, but the absence of data quality is
also found in Machine Learning applications. Data quality can be affected by some
factors as follows:

o Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well
as accuracy in classification tasks.
o Incorrect data- It is also responsible for faulty programming and results obtained in
machine learning models. Hence, incorrect data may affect the accuracy of the results
also.
o Generalizing of output data- Sometimes, it is also found that generalizing output data
becomes complex, which results in comparatively poor future actions.

56
ii. Poor quality of data

As we have discussed above, data plays a significant role in machine learning, and it
must be of good quality as well. Noisy data, incomplete data, inaccurate data, and unclean
data lead to less accuracy in classification and low-quality results. Hence, data quality
can also be considered as a major common problem while processing machine learning
algorithms.
iii. Non-representative training data

To make sure our training model is generalized well or not, we have to ensure that sample
training data must be representative of new cases that we need to generalize. The training
data must cover all cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well for
generalized cases and provides accurate decisions. If there is less training data, then there
will be a sampling noise in the model, called the non-representative training set. It
won't be accurate in predictions. To overcome this, it will be biased against one class or
a group.
Hence, we should use representative data in training to protect against being biased and
make accurate predictions without any drift.

iv. Overfitting and Underfitting

 Overfitting:
Overfitting is one of the most common issues faced by Machine Learning engineers
and data scientists. Whenever a machine learning model is trained with a huge amount
of data, it starts capturing noise and inaccurate data into the training data set. It negatively
affects the performance of the model. Let's understand with a simple example where
we have a few training data sets such as 1000 mangoes, 1000 apples, 1000 bananas,
and 5000 papayas. Then there is a considerable probability of identification of an apple
as papaya because we have a massive amount of biased data in the training data set;
hence prediction got negatively affected. The main reason behind overfitting is using non-
linear methods used in machine learning algorithms as they build non-realistic data
models. We can overcome overfitting by using linear and parametric algorithms in
the machine learning models.
Methods to reduce overfitting:

 Increase training data in a dataset.


 Reduce model complexity by simplifying the model by selecting one with fewer
parameters
57
 Ridge Regularization and Lasso Regularization
 Early stopping during the training phase
 Reduce the noise
 Reduce the number of attributes in training data.
 Constraining the model.
 Underfitting:
Underfitting is just the opposite of overfitting. Whenever a machine learning model is
trained with fewer amounts of data, and as a result, it provides incomplete and inaccurate
data and destroys the accuracy of the machine learning model.
Underfitting occurs when our model is too simple to understand the base structure of the
data, just like an undersized pant. This generally happens when we have limited data
into the data set, and we try to build a linear model with non-linear data. In such scenarios,
the complexity of the model destroys, and rules of the machine learning model become
too easy to be applied on this data set, and the model starts doing wrong predictions as
well.
Methods to reduce Underfitting:

 Increase model complexity


 Remove noise from the data
 Trained on increased and better features
 Reduce the constraints
 Increase the number of epochs to get better results.

v. Monitoring and maintenance


As we know that generalized output data is mandatory for any machine learning model;
hence, regular monitoring and maintenance become compulsory for the same. Different
results for different actions require data change; hence editing of codes as well as
resources for monitoring them also become necessary.
vi. Getting bad recommendations
A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example where
at a specific time customer is looking for some gadgets, but now customer requirement
changed over time but still machine learning model showing same recommendations
to the customer while customer expectation has been changed. This incident is called a
Data Drift. It generally occurs when new data is introduced or interpretation of data
changes.

58
However, we can overcome this by regularly updating and monitoring data according
to the expectations.

vii. Lack of skilled resources


Although Machine Learning and Artificial Intelligence are continuously growing in the
market, still these industries are fresher in comparison to others. The absence of skilled
resources in the form of manpower is also an issue. Hence, we need manpower having in-
depth knowledge of mathematics, science, and technologies for developing and managing
scientific substances for machine learning.
viii. Customer Segmentation
Customer segmentation is also an important issue while developing a machine learning
algorithm. To identify the customers who paid for the recommendations shown by the
model and who don't even check them. Hence, an algorithm is necessary to recognize the
customer behavior and trigger a relevant recommendation for the user based on past
experience.
ix. Process Complexity of Machine Learning
The machine learning process is very complex, which is also another major issue faced by
machine learning engineers and data scientists. However, Machine Learning and Artificial
Intelligence are very new technologies but are still in an experimental phase and
continuously being changing over time. There is the majority of hits and trial experiments;
hence the probability of error is higher than expected. Further, it also includes
analyzing the data, removing data bias, training data, applying complex mathematical
calculations, etc., making the procedure more complicated and quite tedious.
x. Data Bias
Data Biasing is also found a big challenge in Machine Learning. These errors exist when
certain elements of the dataset are heavily weighted or need more importance than others.
Biased data leads to inaccurate results, skewed outcomes, and other analytical errors.
However, we can resolve this error by determining where data is actually biased in the
dataset. Further, take necessary steps to reduce it.
1.13.2 Methods to remove Data Bias:
o Research more for customer segmentation.
o Be aware of your general use cases and potential outliers.
o Combine inputs from multiple sources to ensure data diversity.
o Include bias testing in the development process.

59
o Analyze data regularly and keep tracking errors to resolve them easily.
o Review the collected and annotated data.
o Use multi-pass annotation such as sentiment analysis, content moderation, and
intent recognition.
xi. Lack of Explainability
This basically means the outputs cannot be easily comprehended as it is programmed in
specific ways to deliver for certain conditions. Hence, a lack of explainability is also
found in machine learning algorithms which reduce the credibility of the algorithms.

xii. Slow implementations and results


This issue is also very commonly seen in machine learning models. However, machine
learning models are highly efficient in producing accurate results but are time-consuming.
Slow programming, excessive requirements' and overloaded data take more time to
provide accurate results than expected. This needs continuous maintenance and
monitoring of the model for delivering accurate results.
xiii. Irrelevant features
Although machine learning models are intended to give the best possible outcome, if we
feed garbage data as input, then the result will also be garbage. Hence, we should use
relevant features in our training sample. A machine learning model is said to be good if
training data has a good set of features or less to no irrelevant features.

1.14 Difference between Data Science and Machine Learning:

Table 1.5: Difference between Data Science and Machine Learning

Parameter Data Science Machine Learning


A multidisciplinary field A subset of AI and data science
focused on extracting focusing on building systems
Definition
knowledge and insights that learn from data and
from data. improve from experience.

To analyze and interpret To develop algorithms that can


complex data to aid learn from and make predictions
Objective
decision-making and or decisions based on data.
strategic planning.
Broader, encompassing More focused, primarily on
various techniques for data developing and tuning
Scope
analysis, including algorithms that can learn and
machine learning. make predictions.

60
Tools and Python, R, SQL, Python, R, TensorFlow,
Technologies Tableau, Hadoop, etc. Scikit-Learn, PyTorch, etc.
Processes Data cleaning, data Data preprocessing, model
Involved analysis, data training, model testing, and
visualization, and model deployment.
interpretation.
rket analysis, data Predictive analytics, speech
reporting, siness recognition, recommendation
Applications
analytics, predictive systems, self-driving cars.
deling.
Statistical analysis, Deep understanding of
data visualization, big algorithms, neural networks,
Skills Required
data platforms, statistical modeling, and natural
domain-specific language processing.
knowledge.
To extract insights To enable machines to learn
and knowledge from from data so they can provide
End Goal
data in various accurate predictions and
formats. decisions.
Data Analyst, Data Machine Learning Engineer, AI
Scientist, Data Engineer, Research Scientist,
Career Path
Engineer, Business Data Scientist.
Analyst.

61
1.15 Important Question (Previous Year Questions)

Q1: Explain briefly “History of Machine Learning”.


Q2: Write down the differences between Machine Learning and Data Science.
Q3: Describe how to design a learning system with examples?
Q4: Explain the concept of Machine Learning. Define the term learning. What are the types of
Learning?
Q5: Compare classification and clustering in machine learning along with suitable real-life
applications.
Q6: What is a “Well-Posed Learning Problem”?
Q7: Explain reinforcement learning with a suitable example.
Q8: Differentiate data science and machine learning.
Q9: Explain the issues related with machine learning.
Q10: Discuss Supervised and Unsupervised Learning.
Q11: Write short note on “Well defined Learning System” with examples.
Q12: Describe well defined Learning Problem role’s in Machine Learning.
Q13: What are the Advantages, Disadvantages and Applications of Machine Learning?
Q14: Write short note on “Artificial Neural Networks”.
Q15: Write short note on “Clustering” with its applications.
Q16: Differentiate between Clustering and Classification.
Q17: What are the various Clustering Techniques?
Q18: Explain Decision Trees with advantages and Disadvantages.
Q19: Write short note on “Support Vector Machine”.
Q20: What are the classes of problems in Machine Learning?

62
UNIT 2- Regression, Bayesian Network, SVM

2. REGRESSION, BAYESIAN LEARNINGAND SUPPORT VECTOR MACHINE

REGRESSION

WHY a. To Understand linear regression and logistic regression


machine approaches for separation of data categorically and
numerically.
a. Practice with data of various Problems and analyses the
needs of Learning algorithms.
WHAT
b. implement various Machine Learning methods with various problems.
WHERE a. Used to analyses and evaluate the data in Machine Learning Problems.

Lecture: 9

2.1 Linear Regression


Linear Regression is one of the simplest Machine learning algorithms that comes under
Supervised Learning technique and used for solving regression problems.
It is used for predicting the continuous dependent variable with the help of independent
variables.
The goal of the Linear regression is to find the best fit line that can accurately predict the
output for the continuous dependent variable.
If single independent variable is used for prediction, then it is called Simple Linear
Regression and if there are more than two independent variables then such regression is
called as Multiple Linear Regression.
By finding the best fit line, algorithm establish the relationship between dependent
variable and independent variable. And the relationship should be of linear nature.
The output for Linear regression should only be the continuous values such as price, age,
salary, etc. The relationship between the dependent variable and independent variable can
be shown in below image:

63
Figure 2.1: Linear Regression

In above image the dependent variable is on Y-axis (salary) and independent variable is
on x-axis(experience). The regression line can be written as:

y= a0+a1x+ ε

Where, a0 and a1 are the coefficients and ε is the error term.

64
2.2 Logistic Regression:
Logistic regression is one of the most popular Machine learning algorithm that comes
under Supervised Learning techniques.
It can be used for Classification as well as for Regression problems, but mainly used
for Classification problems.
Logistic regression is used to predict the categorical dependent variable with the help
of independent variables.
The output of Logistic Regression problem can be only between the 0 and 1.
Logistic regression can be used where the probabilities between two classes is
required. Such as whether it will rain today or not, either 0 or 1, true or false etc.
Logistic regression is based on the concept of Maximum Likelihood estimation.
According to this estimation, the observed data should be most probable.
In logistic regression, we pass the weighted sum of inputs through an activation function
that can map values in between 0 and 1. Such activation function is known as sigmoid
function and the curve obtained is called as sigmoid curve or S-curve. Consider the below
image:

Figure 2.2: Logistic Regression

65
o The equation for logistic regression is:

2.3 Linear Regression vs Logistic Regression


Linear Regression and Logistic Regression are the two famous Machine Learning
Algorithms which come under supervised learning technique. Since both the
algorithms are of supervised in nature hence these algorithms use labeled dataset to make
the predictions. But the main difference between them is how they are being used. The
Linear Regression is used for solving Regression problems whereas Logistic Regression
is used for solving the Classification problems. The description of both the algorithms is
given below along with difference table.

Figure 2.4: Linear v/s Logistic Regression

66
2.3.1 Difference between Linear Regression and Logistic Regression:

Table 2.1: Linear v/s Logistic Regression

Linear Regression Logistic Regression


Logistic Regression is used to predict
Linear regression is used to predict the
the categorical dependent variable
continuous dependent variable using a
using a given set of independent
given set of independent variables.
variables.
Linear Regression is used for solving Logistic regression is used for
Regression problem. solving Classification problems.

In Linear regression, we predict the In logistic Regression, we predict


value of the

In linear regression, we find the best fit In Logistic Regression, we find


line, by which we can easily predict the S- curve by which we can
the output. classify the samples.

Least square estimation method is used Maximum likelihood


for estimation of accuracy. estimation method is used for
estimation of accuracy.

The output for Linear Regression must The output of Logistic


be a continuous value, such as price, Regression must be a
age, etc. Categorical value such as 0 or 1,
Yes or No, etc.

67
In Logistic regression, it is not
In Linear regression, it is required that
required to have the
relationship between dependent variable
linear

relationship between the dependent


and independent variable must be linear.
and independent variable.
In linear regression, there may be In logistic regression, there
collinearity between the independent should not be collinearity
variables. between the independent
variable.

68
Lecture: 10

2.4 Bayes Theorem in Machine learning


Machine Learning is one of the most emerging technology of Artificial Intelligence. We
are living in the 21th century which is completely driven by new technologies and gadgets
in which some are yet to be used and few are on its full potential. Similarly, Machine
Learning is also a technology that is still in its developing phase. There are lots of
concepts that make machine learning a better technology such as supervised learning,
unsupervised learning, reinforcement learning, perceptron models, Neural networks, etc.
In this article "Bayes Theorem in Machine Learning", we will discuss another most
important concept of Machine Learning theorem i.e., Bayes Theorem. But before starting
this topic you should have essential understanding of this theorem such as what exactly
is Bayes theorem, why it is used in Machine Learning, examples of Bayes theorem in
Machine Learning and much more. So, let's start the brief introduction of Bayes theorem.

2.4.1 Introduction to Bayes Theorem in Machine Learning


Bayes theorem is given by an English statistician, philosopher, and Presbyterian minister
named Mr. Thomas Bayes in 17th century. Bayes provides their thoughts in decision
theory which is extensively used in important mathematics concepts as Probability. Bayes
theorem is also widely used in Machine Learning where we need to predict classes
precisely and accurately. An important concept of Bayes theorem named Bayesian
method is used to calculate conditional probability in Machine Learning application that
includes classification tasks. Further, a simplified version of Bayes theorem (Naïve Bayes
classification) is also used to reduce computation time and average cost of the projects.
Bayes theorem is also known with some other name such as Bayes rule or
Bayes Law. Bayes theorem helps to determine the probability of an event with random
knowledge. It is used to calculate the probability of occurring one event while other one
already occurred. It is a best method to relate the condition probability and marginal
probability.
In simple words, we can say that Bayes theorem helps to contribute more accurate results.
Bayes Theorem is used to estimate the precision of values and provides a method for
calculating the conditional probability. However, it is hypocritically a simple calculation
but it is used to easily calculate the conditional probability of events where intuition often
fails. Some of the data scientist assumes that Bayes theorem is most widely used in
financial industries but it is not like that. Other than financial, Bayes theorem is also
extensively applied in health and medical, research and survey industry, aeronautical
sector, etc.
 Bayes Theorem
Bayes theorem is one of the most popular machine learning concepts that helps to
calculate the probability of occurring one event with uncertain knowledge while other one
has already occurred.
Bayes' theorem can be derived using product rule and conditional probability of event X

69
with known event Y:
o According to the product rule we can express as the probability of event X with
known event Y as follows;
P(X ? Y)= P(X|Y) P(Y) {equation 1}
o Further, the probability of event Y with known event X:
P(X ? Y)= P(Y|X) P(X) {equation 2}
Mathematically, Bayes theorem can be expressed by combining both equations on
right hand side. We will get:

Figure 2.5: Bayes Theorem

Here, both events X and Y are independent events which means probability of outcome
of both events does not depends one another.
The above equation is called as Bayes Rule or Bayes Theorem.
o P(X|Y) is called as posterior, which we need to calculate. It is defined as updated
probability after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is
true.
o P(X) is called the prior probability, probability of hypothesis before considering the
evidence
o P(Y) is called marginal probability. It is defined as the probability of evidence under any
consideration.

Hence, Bayes Theorem can be written as:


posterior = likelihood * prior / evidence

2.4.2 Prerequisites for Bayes Theorem


While studying the Bayes theorem, we need to understand few important concepts. These
are as follows:
i. Experiment
An experiment is defined as the planned operation carried out under controlled condition
such as tossing a coin, drawing a card and rolling a dice, etc.

70
ii. Sample Space
During an experiment what we get as a result is called as possible outcomes and the set of
all possible outcome of an event is known as sample space. For example, if we are rolling
a dice, sample space will be:
S1 = {1, 2, 3, 4, 5, 6}
Similarly, if our experiment is related to toss a coin and recording its outcomes,
then sample space will be:
S2 = {Head, Tail}
iii. Event
Event is defined as subset of sample space in an experiment. Further, it is also called as
set of outcomes.

Figure 2.6:Sample Space

Assume in our experiment of rolling a dice, there are two event A and B such that;

A = Event when an even number is obtained = {2, 4, 6}

B = Event when a number is greater than 4 = {5, 6}

o Probability of the event A ''P(A)''= Number of favourable outcomes / Total


number of possible outcomes
o P(E) = 3/6 =1/2 =0.5
o Similarly, Probability of the event B ''P(B)''= Number of favourable outcomes /
Total number of possible outcomes
=2/6
71
=1/3
=0.333
o Union of event A and B:
A∪B = {2, 4, 5, 6}

Figure 2.7: Intersection

A∩B= {6}
o Disjoint Event: If the intersection of the event A and B is an empty set or null then
such events are known as disjoint event or mutually exclusive events also.

Figure 2.8: Disjoint Events

72
iv. Random Variable:
It is a real value function which helps mapping between sample space and a real line of an
experiment. A random variable is taken on some random values and each value
having some probability. However, it is neither random nor a variable but it behaves as a
function which can either be discrete, continuous or combination of both.
v. Exhaustive Event:
As per the name suggests, a set of events where at least one event occurs at a time, called
exhaustive event of an experiment.
Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a
time and both are mutually exclusive for e.g., while tossing a coin, either it will be a
Head or may be a Tail.
vi. Independent Event:
Two events are said to be independent when occurrence of one event does not affect the
occurrence of another event. In simple words we can say that the probability of
outcome of both events does not depends one another.
Mathematically, two events A and B are said to be independent if: P(A ∩ B) = P(AB) =
P(A)*P(B)
vii. Conditional Probability:
Conditional probability is defined as the probability of an event A, given that another
event B has already occurred (i.e. A conditional B). This is represented by P(A|B) and we
can define it as:
P(A|B) = P(A ∩ B) / P(B)
viii. Marginal Probability:
Marginal probability is defined as the probability of an event A occurring independent
of any other event B. Further, it is considered as the probability of evidence under any
consideration.
P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)

Figure 2.9: Solution

73
Here ~B represents the event that B does not occur.

2.4.2 How to apply Bayes Theorem or Bayes rule in Machine Learning?


Bayes theorem helps us to calculate the single term P(B|A) in terms of P(A|B), P(B),
and
P(A). This rule is very helpful in such scenarios where we have a good probability of
P(A|B), P(B), and P(A) and need to determine the fourth term.
Naïve Bayes classifier is one of the simplest applications of Bayes theorem which is used
in classification algorithms to isolate data as per accuracy, speed and classes.
Let's understand the use of Bayes theorem in machine learning with below example.
Suppose, we have a vector A with I attributes. It means
A = A1, A2, A3, A4 Ai
Further, we have n classes represented as C1, C2, C3, C4 Cn.
These are two conditions given to us, and our classifier that works on Machine
Language has to predict A and the first thing that our classifier has to choose will
be the best possible class. So, with the help of Bayes theorem, we can write it as:
P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)
Here;
P(A) is the condition-independent entity.
P(A) will remain constant throughout the class means it does not change its value with
respect to change in class. To maximize the P(Ci/A), we have to maximize the value of
term P(A/Ci) * P(Ci).
With n number classes on the probability list let's assume that the possibility of any
class being the right answer is equally likely. Considering this factor, we can say that:
P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).
This process helps us to reduce the computation cost as well as time. This is how Bayes
theorem plays a significant role in Machine Learning and Naïve Bayes theorem has
simplified the conditional probability tasks without affecting the precision. Hence, we can
conclude that:
P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C)
Hence, by using Bayes theorem in Machine Learning we can easily describe the
possibilities of smaller events.

2.5 Concept Learning in Machine Learning


The problem of inducing general functions from specific training examples is central to
learning.
Concept learning can be formulated as a problem of searching through a predefined space

74
of potential hypotheses for the hypothesis that best fits the training examples.
What is Concept Learning…?
“A task of acquiring potential hypothesis (solution) that best fits the given training
examples.”

Figure 0.1: Concept Learning

Consider the example task of learning the target concept “days on which my friend
Prabhas enjoys his favorite water sport.”
Below Table describes a set of example days, each represented by a set of attributes. The
attribute EnjoySport indicates whether or not Prabhas enjoys his favorite water sport
on this day. The task is to learn to predict the value of EnjoySport for an arbitrary day,
based on the values of its other attributes.

75
What hypothesis representation shall we provide to the learner in this case?

76
What hypothesis representation shall we provide to the learner in this case?
Let us begin by considering a simple representation in which each hypothesis consists of
a conjunction of constraints on the instance attributes.
In particular, let each hypothesis be a vector of six constraints, specifying the values of
the six attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast.
For each attribute, the hypothesis will either
• indicate by a “?’ that any value is acceptable for this attribute,
• specify a single required value (e.g., Warm) for the attribute, or
• indicate by a “ø” that no value is acceptable.
If some instance x satisfies all the constraints of hypothesis h, then h classifies x as
a positive example (h(x) = 1).
To illustrate, the hypothesis that Prabhas enjoys his favorite sport only on cold days with
high humidity (independent of the values of the other attributes) is represented by the
expression
(?, Cold, High, ?, ?, ?)
Most General and Specific Hypothesis
The most general hypothesis-that every day is a positive example-is represented by (?,
?, ?, ?, ?, ?) and the most specific possible hypothesis-that no day is a positive example-
is represented by (ø, ø, ø, ø, ø, ø).

2.5.1 A CONCEPT LEARNING TASK – Search


Concept learning can be viewed as the task of searching through a large space of
77
hypotheses implicitly defined by the hypothesis representation.
The goal of this search is to find the hypothesis that best fits the training examples.
It is important to note that by selecting a hypothesis representation, the designer of the
learning algorithm implicitly defines the space of all hypotheses that the program can ever
represent and therefore can ever learn.
Instance Space
Consider, for example, the instances X and hypotheses H in the EnjoySport learning
task.
Given that the attribute Sky has three possible values, and that AirTemp, Humidity,
Wind, Water, and Forecast each have two possible values, the instance space X contains
exactly
3 . 2 . 2 . 2 . 2 . 2 = 96 distinct instances.
Example:
Let’s assume there are two features F1 and F2 with F1 has A and B as possibilities and
F2 as X and Y as possibilities.
F1 – > A, B F2 – > X, Y
Instance Space: (A, X), (A, Y), (B, X), (B, Y) – 4 Examples
Hypothesis Space: (A, X), (A, Y), (A, ø), (A, ?), (B, X), (B, Y), (B, ø), (B, ?), (ø, X),
(ø, Y), (ø, ø), (ø,
?), (?, X), (?, Y), (?, ø), (?, ?) – 16
Hypothesis Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y (?, ?) – 10

Instance Space
Hypothesis Space
Similarly there are 5 . 4 . 4 . 4 . 4 . 4 = 5120 syntactically distinct hypotheses within H.
Notice, however, that every hypothesis containing one or more “ø” symbols represents
78
the empty set of instances; that is, it classifies every instance as negative.
Therefore, the number of semantically distinct hypotheses is only 1 + (4 . 3 . 3 . 3 . 3 . 3) =
973.
Our EnjoySport example is a very simple learning task, with a relatively small,
finite hypothesis space.

2.5.2 General-to-Specific Ordering of Hypotheses


To illustrate the general-to-specific ordering, consider the two hypotheses h1 = (Sunny, ?,
?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
Now consider the sets of instances that are classified positive by hl and by h2. Because
h2 imposes fewer constraints on the instance, it classifies more instances as positive.
In fact, any instance classified positive by h1 will also be classified positive by
h2. Therefore, we say that h2 is more general than h1.
For any instance x in X and hypothesis h in H, we say that x satisfies h if and only if
h(x) = 1.
We define the more_general_than_or_equale_to relation in terms of the sets of instances
that satisfy the two hypotheses.

79
2.6 Bayes Optimal Classifier and Naive Bayes Classifier
The Bayes Optimal Classifier is a probabilistic model that predicts the most likely
outcome for a new situation. In this blog, we’ll have a look at Bayes optimal
classifier and Naive Bayes Classifier.
The Bayes theorem is a method for calculating a hypothesis’s probability based on its prior
probability, the probabilities of observing specific data given the hypothesis, and the seen
data itself.

80
Lecture: 11

2.7 What is Naïve Bayes Classifier in Machine Learning


Naïve Bayes theorem is also a supervised algorithm, which is based on Bayes theorem and
used to solve classification problems. It is one of the most simple and effective
classification algorithms in Machine Learning which enables us to build various ML
models for quick predictions. It is a probabilistic classifier that means it predicts on the
basis of probability of an object. Some popular Naïve Bayes algorithms are spam
filtration, Sentimental analysis, and classifying articles.
2.7.1 Advantages of Naïve Bayes Classifier in Machine Learning:
It is one of the simplest and effective methods for calculating the conditional
probability and text classification problems.
A Naïve-Bayes classifier algorithm is better than all other models where assumption of
independent predictors holds true.
It is easy to implement than other models.
It requires small amount of training data to estimate the test data which minimize the
training time period.
It can be used for Binary as well as Multi-class Classifications.
2.7.2 Disadvantages of Naïve Bayes Classifier in Machine Learning:
The main disadvantage of using Naïve Bayes classifier algorithms is, it limits the
assumption of independent predictors because it implicitly assumes that all attributes are
independent or unrelated but in real life it is not feasible to get mutually independent
attributes.

2.8 Bayesian Belief Network in artificial intelligence


Bayesian belief network is key computer technology for dealing with probabilistic events
and to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian
model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:

81
Directed Acyclic Graph

Table of conditional probabilities.


The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:

Figure 2.9: Directed Acyclic Graph

 Each node corresponds to the random variables, and a variable can be continuous or
discrete.
 Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in
the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
 In the above diagram, A, B, C, and D, are random variables represented by the nodes of
the network graph.
 If we are considering node B, which is connected with node A by a directed arrow,
then node A is called the parent of Node B.
 Node C is independent of node A.

Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG

82
The Bayesian network has mainly two components:

i. Causal Component
ii. Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability.
So let's first understand the joint probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
= P[x1| x2, x3,....., xn]P[x2, x3, , xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn] P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as: P(Xi|Xi-1, , X1) = P(Xi
|Parents(Xi )).

83
Lecture: 12

2.9 EM Algorithm in Machine Learning


The EM algorithm is considered a latent variable model to find the local maximum
likelihood parameters of a statistical model, proposed by Arthur Dempster, Nan Laird,
and Donald Rubin in 1977. The EM (Expectation- Maximization) algorithm is one of
the most commonly used terms in machine learning to obtain maximum likelihood
estimates of variables that are sometimes observable and sometimes not. However, it
is also applicable to unobserved data or sometimes called latent. It has various real-
world applications in statistics, including obtaining the mode of the posterior
marginal distribution of parameters in machine learning and data mining
applications.
In most real-life applications of machine learning, it is found that several relevant learning
features are available, but very few of them are observable, and the rest are unobservable.
If the variables are observable, then it can predict the value using instances. On the other
hand, the variables which are latent or directly not observable, for such variables
Expectation-Maximization (EM) algorithm plays a vital role to predict the value with the
condition that the general form of probability distribution governing those latent variables
is known to us. In this topic, we will discuss a basic introduction to the EM algorithm,
a flow chart of the EM algorithm, its applications, advantages, and disadvantages of EM
algorithm, etc.
2.9.1 What is an EM algorithm?
The Expectation-Maximization (EM) algorithm is defined as the combination of various
unsupervised machine learning algorithms, which is used to determine the local
maximum likelihood estimates (MLE) or maximum a posteriori estimates (MAP)
for unobservable variables in statistical models. Further, it is a technique to find maximum
likelihood estimation when the latent variables are present. It is also referred to as the latent
variable model.
A latent variable model consists of both observable and unobservable variables where
observable can be predicted while unobserved are inferred from the observed variable.
These unobservable variables are known as latent variables.
Key Points:
i. It is known as the latent variable model to determine MLE and MAP parameters for latent
variables.
ii. It is used to predict values of parameters in instances where data is missing or unobservable
for learning, and this is done until convergence of the values occurs.

84
2.9.2 EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms, such
as the k-means clustering algorithm. Being an iterative approach, it consists of two
modes. In the first mode, we estimate the missing or latent variables. Hence it is referred
to as the Expectation/estimation step (E-step). Further, the other mode is used to
optimize the parameters of the models so that it can explain the data more clearly. The
second mode is known as the maximization-step or M-step.

Figure 2.10: EM Algorithm

iii. Expectation step (E - step): It involves the estimation (guess) of all missing values in
the dataset so that after completing this step, there should not be any missing value.
iv. Maximization step (M - step): This step involves the use of estimated data in the E-
step and updating the parameters.
v. Repeat E-step and M-step until the convergence of the values occurs.

The primary goal of the EM algorithm is to use the available observed data of the dataset
to estimate the missing data of the latent variables and then use that data to update the
values of the parameters in the M-step.

85
2.9.2 What is Convergence in the EM algorithm?
Convergence is defined as the specific situation in probability based on intuition,
e.g., if there are two random variables that have very less difference in their
probability, then they are known as converged. In other words, whenever the values
of given variables are matched with each other, it is called convergence.
2.9.3 Steps in EM Algorithm
The EM algorithm is completed mainly in 4 steps, which include Initialization Step,
Expectation Step, Maximization Step, and convergence Step. These steps are
explained as follows:

Figure 2.10: Flow Chart

vi. 1st Step: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained from
a specific model.
vii. 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or guess
the values of the missing or incomplete data using the observed data. Further, E-step
primarily updates the variables.

86
viii. 3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily updates
the hypothesis.
ix. 4th step: The last step is to check if the values of latent variables are converging or not. If
it gets "yes", then stop the process; else, repeat the process from step 2 until the
convergence occurs.
2.9.3 Gaussian Mixture Model (GMM)
The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function. Further, GMM also
requires estimated statistics values such as mean and standard deviation or parameters.
It is used to estimate the parameters of the probability distributions to best fit the
density of a given training dataset. Although there are plenty of techniques available
to estimate the parameter of the Gaussian Mixture Model (GMM), the Maximum
Likelihood Estimation is one of the most popular techniques among them.
Let's understand a case where we have a dataset with multiple data points generated
by two different processes. However, both processes contain a similar Gaussian
probability distribution and combined data. Hence it is very difficult to discriminate which
distribution a given point may belong to.
The processes used to generate the data point represent a latent variable or unobservable
data. In such cases, the Estimation-Maximization algorithm is one of the best techniques
which helps us to estimate the parameters of the gaussian distributions. In the EM
algorithm, E-step estimates the expected value for each latent variable, whereas M-step
helps in optimizing them significantly using the Maximum Likelihood Estimation (MLE).
Further, this process is repeated until a good set of latent values, and a maximum
likelihood is achieved that fits the data.

2.9.4 Applications of EM algorithm


The primary aim of the EM algorithm is to estimate the missing data in the latent variables
through observed data in datasets. The EM algorithm or latent variable model has a broad
range of real-life applications in machine learning. These are as follows:

87
x. The EM algorithm is applicable in data clustering in machine learning. It is often used in
computer vision and NLP (Natural language processing).
xi. It is used to estimate the value of the parameter in mixed models such as the
Gaussian Mixture Model and quantitative genetics.
xii. It is also used in psychometrics for estimating item parameters and latent abilities of item
response theory models.
xiii. It is also applicable in the medical and healthcare industry, such as in image
reconstruction and structural engineering.
xiv. It is used to determine the Gaussian density of a function.

 Advantages of EM algorithm
i. It is very easy to implement the first two basic steps of the EM algorithm in various
machine learning problems, which are E-step and M- step.
ii. It is mostly guaranteed that likelihood will enhance after each iteration.
iii. It often generates a solution for the M-step in the closed form.

 Disadvantages of EM algorithm
i. The convergence of the EM algorithm is very slow.
ii. It can make convergence for the local optima only.
iii. It takes both forward and backward probability into consideration. It is opposite to that
of numerical optimization, which takes only forward probabilities.

88
Lecture: 13
2.10 Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:

Figure 2.11: SVM

Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then
we test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will
see the extreme case of cat and dog. On the basis of the support vectors, it will classify it
as a cat. Consider the below diagram:

89
Figure 2.12: Data Flow

SVM algorithm can be used for Face detection, image classification, text categorization,
etc.

90
Lecture: 14
2.11 Types of SVM
SVM can be of two types:

[Link] SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
[Link]-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
 Major Kernel Functions in Support Vector Machine
What is Kernel Method?
A set of techniques known as kernel methods are used in machine learning to address
classification, regression, and other prediction issues. They are built around the idea of
kernels, which are functions that gauge how similar two data points are to one another
in a high-dimensional feature space.
Kernel methods' fundamental premise is used to convert the input data into a high-
dimensional feature space, which makes it simpler to distinguish between classes or
generate predictions. Kernel methods employ a kernel function to implicitly map the data
into the feature space, as opposed to manually computing the feature space.
The most popular kind of kernel approach is the Support Vector Machine (SVM), a binary
classifier that determines the best hyperplane that most effectively divides the two
groups. In order to efficiently locate the ideal hyperplane, SVMs map the input into a
higher-dimensional space using a kernel function.
Other examples of kernel methods include kernel ridge regression, kernel PCA, and
Gaussian processes. Since they are strong, adaptable, and computationally efficient, kernel
approaches are frequently employed in machine learning. They are resilient to noise
and outliers and can handle sophisticated data structures like strings and graphs.

91
2.11.1 Kernel Method in SVMs

Support Vector Machines (SVMs) use kernel methods to transform the input data into
a higher-dimensional feature space, which makes it simpler to distinguish between classes
or generate predictions. Kernel approaches in SVMs work on the fundamental principle
of implicitly mapping input data into a higher-dimensional feature space without directly
computing the coordinates of the data points in that space.
The kernel function in SVMs is essential in determining the decision boundary that divides
the various classes. In order to calculate the degree of similarity between any two points
in the feature space, the kernel function computes their dot product.
The most commonly used kernel function in SVMs is the Gaussian or radial basis function
(RBF) kernel. The RBF kernel maps the input data into an infinite- dimensional feature
space using a Gaussian function. This kernel function is popular because it can capture
complex nonlinear relationships in the data.
Other types of kernel functions that can be used in SVMs include the polynomial kernel,
the sigmoid kernel, and the Laplacian kernel. The choice of kernel function depends on
the specific problem and the characteristics of the data.
Basically, kernel methods in SVMs are a powerful technique for solving classification and
regression problems, and they are widely used in machine learning because they can
handle complex data structures and are robust to noise and outliers.

 Characteristics of Kernel Function


Kernel functions used in machine learning, including in SVMs (Support Vector
Machines), have several important characteristics, including:
i. Mercer's condition: A kernel function must satisfy Mercer's condition to be valid.
This condition ensures that the kernel function is positive semi definite, which
means that it is always greater than or equal to zero.
ii. Positive definiteness: A kernel function is positive definite if it is always
greater than zero except for when the inputs are equal to each other.
iii. Non-negativity: A kernel function is non-negative, meaning that it produces non-
negative values for all inputs.
iv. Symmetry: A kernel function is symmetric, meaning that it produces the same
value regardless of the order in which the inputs are given.

92
v. Reproducing property: A kernel function satisfies the reproducing property if it
can be used to reconstruct the input data in the feature space.
vi. Smoothness: A kernel function is said to be smooth if it produces a smooth
transformation of the input data into the feature space.
vii. Complexity: The complexity of a kernel function is an important consideration,
as more complex kernel functions may lead to over fitting and reduced
generalization performance.
viii. Basically, the choice of kernel function depends on the specific problem and the
characteristics of the data, and selecting an appropriate kernel function can
significantly impact the performance of machine learning algorithms.
2.11.2 Major Kernel Function in Support Vector Machine
In Support Vector Machines (SVMs), there are several types of kernel functions that can
be used to map the input data into a higher-dimensional feature space. The choice of kernel
function depends on the specific problem and the characteristics of the data.
Here are some most commonly used kernel functions in SVMs:
 Linear Kernel
A linear kernel is a type of kernel function used in machine learning, including in SVMs
(Support Vector Machines). It is the simplest and most commonly used kernel function,
and it defines the dot product between the input vectors in the original feature space.
The linear kernel can be defined as:
K(x, y) = x .y
Where x and y are the input feature vectors. The dot product of the input vectors is a
measure of their similarity or distance in the original feature space.
When using a linear kernel in an SVM, the decision boundary is a linear hyperplane that
separates the different classes in the feature space. This linear boundary can be useful
when the data is already separable by a linear decision boundary or when dealing with
high-dimensional data, where the use of more complex kernel functions may lead to
overfitting.
 Polynomial Kernel
A particular kind of kernel function utilised in machine learning, such as in SVMs, is
a polynomial kernel (Support Vector Machines). It is a nonlinear kernel function that
employs polynomial functions to transfer the input data into a higher-dimensional feature
space.
One definition of the polynomial kernel is:
Where x and y are the input feature vectors, c is a constant term, and d is the degree
of the polynomial, K(x, y) = (x. y + c) d. The constant term is added to, and the dot
product of the input vectors elevated to the degree of the polynomial.
The decision boundary of an SVM with a polynomial kernel might capture more intricate
correlations between the input characteristics because it is a nonlinear hyperplane.

93
The degree of nonlinearity in the decision boundary is determined by the degree of the
polynomial.
The polynomial kernel has the benefit of being able to detect both linear and nonlinear
correlations in the data. It can be difficult to select the proper degree of the polynomial,
though, as a larger degree can result in overfitting while a lower degree cannot adequately
represent the underlying relationships in the data.
In general, the polynomial kernel is an effective tool for converting the input data
into a higher-dimensional feature space in order to capture nonlinear correlations between
the input characteristics.
 Gaussian (RBF) Kernel
The Gaussian kernel, also known as the radial basis function (RBF) kernel, is a
popular kernel function used in machine learning, particularly in SVMs (Support Vector
Machines). It is a nonlinear kernel function that maps the input data into a higher-
dimensional feature space using a Gaussian function.
The Gaussian kernel can be defined as:

K(x, y) = exp(-gamma * ||x - y||^2)

Where x and y are the input feature vectors, gamma is a parameter that controls the width
of the Gaussian function, and ||x - y||^2 is the squared Euclidean distance between the input
vectors.
When using a Gaussian kernel in an SVM, the decision boundary is a nonlinear hyper
plane that can capture complex nonlinear relationships between the input features.
The width of the Gaussian function, controlled by the gamma parameter, determines the
degree of nonlinearity in the decision boundary.
One advantage of the Gaussian kernel is its ability to capture complex relationships in
the data without the need for explicit feature engineering. However, the choice of the
gamma parameter can be challenging, as a smaller value may result in under fitting, while
a larger value may result in over fitting.
 Laplace Kernel
The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is a type
of kernel function used in machine learning, including in SVMs (Support Vector
Machines). It is a non-parametric kernel that can be used to measure the similarity or
distance between two input feature vectors.
The Laplacian kernel can be defined as:

K(x, y) = exp(-gamma * ||x - y||)

Where x and y are the input feature vectors, gamma is a parameter that controls the width

94
of the Laplacian function, and ||x - y|| is the L1 norm or Manhattan distance between the
input vectors.
When using a Laplacian kernel in an SVM, the decision boundary is a nonlinear
hyperplane that can capture complex relationships between the input features. The width
of the Laplacian function, controlled by the gamma parameter, determines the degree
of nonlinearity in the decision boundary.
One advantage of the Laplacian kernel is its robustness to outliers, as it places less weight
on large distances between the input vectors than the Gaussian kernel. However, like the
Gaussian kernel, choosing the correct value of the gamma parameter can be challenging.

95
Lecture: 15

2.12 Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

2.12.1 Support Vectors:


The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.

 Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose
we have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green
or blue. Consider the below image:

Figure 2.12: SVM

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:

96
Figure 2.13: Hyperplane

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is
to maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

Figure 2.14: Constructing Hyperplane

 Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:

97
Figure 2.15: Non- Linear

So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z=x2 +y2

By adding the third dimension, the sample space will become as below image:

Figure 2.16: Hyperplane

o now, SVM will divide the datasets into classes in the following way. Consider the below
image:
98
Figure 2.16: Best Hyperplane

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If
we convert it in 2d space with z=1, then it will become as:

Figure 2.17: Dimensional Hyperplane

Hence we get a circumference of radius 1 in case of non-linear data.

99
Lecture: 16
2.13 Properties of SVM
 Flexibility in choosing a similarity function
 Sparseness of solution when dealing with large data sets
 only support vectors are used to specify the separating hyperplane
 Ability to handle large feature spaces
 Complexity does not depend on the dimensionality of the feature space
 Overfitting can be controlled by soft margin approach
 Nice math property: a simple convex optimization problem which
is guaranteed to converge to a single global solution
 Feature Selection

2.13.1 The Disadvantages of Support Vector Machine (SVM) are:


 Unsuitable to Large Datasets
 Large training time
 More features, more complexities
 Bad performance on high noise
 Does not determine Local optima

i. Unsuitable to Large Datasets


Support Vector Machines creates a margin of separation between the data point to
be classified. The usage of large datasets has its cons even if we use kernel trick for
classification. No matter how computationally efficient is the calculation, it is
suitable for small to medium size datasets, as the feature space can be very high
dimensional, or even infinite dimensional. The method becomes infeasible for large
datasets. For large datasets, this can still give us rich feature space representations,
but with many fewer dimensions than data points. It will not support large datasets
and many dimensions at the same time.

ii. Large training time


Due to high computational complexities and above stated reasons even if kernel trick
is used, SVM classification will be tedious as it will use a lot of processing time due
to complexities in calculations. This will result large time to train the datasets itself.

100
iii. More features, more complexities
More the features are taken into consideration, it will result in more dimensions
coming into p l a y . If the number of features is much greater than the number
of samples, avoid over-fitting in choosing Kernel functions and regularization term
is crucial.

iv. Bad performance on high noise


SVM does not perform very well, when the data set has more noise. When the data
has noise, it contains many overlapping points; there is a problem in drawing a
clear hyperplane without misclassifying.
Soft margin classification however allows misclassification to a small extent.
But as the noise increases, the amount of datapoints overlapping and disturbances
result in more misclassifications which is not ideal.

v. Does not determine Local optima


If you use gradient descent to solve the SVM optimization problem, then you'll
always converge to the global minimum.

101
2.14 Important Questions (Previous Year Question)
Q1: Explain Linear, Polynomial and Gaussian Kernel (Radial Basis Function) in detail.
Q2: Differentiate between Linear Regression and Logistic Regression.
Q3: What are the types of Logistics Regression?
Q4: Describe briefly Linear Regression and Logistic Regression.
Q5: What is the assumption in Naïve Bayesian Algorithm that makes it different from
Bayesian Theorem?
Q6: Discuss the various properties and issues of SVM.
Q7: Why SVM is an example of a large margin classifier? Discuss the different kernel
functions used in SVM.
Q8: Explain the EM algorithm with the necessary steps.
Q9: Write short note on “Bayesian Belief Networks”.
Q10: What is Bayesian Learning? Explain how the decision error for Bayesian
Classification is minimized.
Q11: Define Bayes Classifier. Explain how Classification is done using Bayes Classifier.
Q12: Discuss Bayes Classifier using some examples in detail.
Q13: Explain Naïve Bayes Classifier.
Q14: Describe the Usage, Advantages and Disadvantages of EM Algorithm.
Q15: How is the Bayesian Network powerful representation for uncertainty knowledge?
Explain with example.
Q16: Explain the role Prior Probability and Posterior Probability in Bayesian
Classification.
Q17: Explain the types and properties of Support Vector Machine.
Q18: What are the parameters used in Support Vector Classifier?
Q19: What are the Advantages and Disadvantages of Support Vector Machines?
Q20: Write a short Note on Hyper plane (Decision Surface).

102
UNIT 3 – Decision Tree Learning

3. DECISION TREE LEARNING AND INSTANCE BASED LEARNING

DECISION TREE LEARNING


a. To Understand latest trends in Machine Learning.
b. To Understand Decision Tree Learning Algorithm,
Inductive bias, Inductive inference with decision trees,
WHY
Entropy and Information theory and Information gain.
c. To Understand ID3 Algorithm and issues in
decision tree Learning.
a. Implement and Analyse various Problems using
WHAT
Decision Tree Machine Learning Algorithms.
a. Classification of Homogenous data.
WHERE
b. Classification of Categorical and Numerical data.
Lecture: 17
3.1 Decision Tree Classification Algorithm
 Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
 The decisions or the test are performed on the basis of features of the given dataset.
 It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
 In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
103
Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric
data.

Figure 3.1: Decision Tree

3.1.1 Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm
for the given dataset and problem is the main point to remember while creating a
machine learning model. Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it
is easy to understand.
The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
104
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
3.1.2 How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute
with the record (real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree. The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits further
into the next decision node (distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into one decision node
(Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:

105
Figure 3.2: Decision Tree Example

3.1.3 Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:
I. Information Gain
II. Gini Index
i. Information Gain:
Information gain is the measurement of changes in entropy after the segmentation of
a dataset based on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build the decision
tree.
A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
106
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no
ii. Gini Index:
Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high
Gini index.
It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size
of the learning tree without reducing accuracy is known as Pruning. There are mainly
two types of tree pruning technology used:
i. Cost Complexity Pruning
ii. Reduced Error Pruning.
Advantages of the Decision Tree
 It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
 The decision tree contains lots of layers, which makes it complex.
 It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
 For more class labels, the computational complexity of the decision tree may
increase.
107
Lecture:18

3.2 Inductive Bias


An technique of machine learning called inductive learning trains a model to
generate predictions based on examples or observations. During inductive learning,
the model picks up knowledge from particular examples or instances and generalizes
it such that it can predict outcomes for brand-new data.
3.3 Inductive inference with decision tree
Inductive inference with decision trees is a machine learning technique that uses
specific examples to reach a general conclusion that can be applied to unseen
examples.
What is the inductive bias in decision tree learning?
Inductive bias is anything which makes the algorithm learn one pattern instead of
another pattern (e.g. step-functions in decision trees instead of continuous function
in a linear regression model). Learning is the process of apprehending useful
knowledge by observing and interacting with the world.
3.4 What is Inductive Learning Algorithm?
Inductive Learning Algorithm (ILA) is an iterative and inductive machine
learning algorithm that is used for generating a set of classification rules, which
produces rules of the form “IF-THEN”, for a set of examples, producing rules at
each iteration and appending to the set of rules.
There are basically two methods for knowledge extraction firstly from domain
experts and then with machine learning. For a very large amount of data, the domain
experts are not very useful and reliable. So we move towards the machine learning
approach for this work. To use machine learning One method is to replicate the
expert’s logic in the form of algorithms but this work is very tedious, time taking,
and expensive. So we move towards the inductive algorithms which generate the
strategy for performing a task and need not instruct separately at each step.
Why you should use Inductive Learning?
The ILA is a new algorithm that was needed even when other reinforcement
learnings like ID3 and AQ were available.
 The need was due to the pitfalls which were present in the previous algorithms, one
of the major pitfalls was the lack of generalization of rules.
 The ID3 and AQ used the decision tree production method which was too specific
which were difficult to analyze and very slow to perform for basic short
classification problems.
 The decision tree-based algorithm was unable to work for a new problem if some
attributes are missing.

108
 The ILA uses the method of production of a general set of rules instead of decision
trees, which overcomes the above problems
Basic Requirements to Apply Inductive Learning Algorithm
i. List the examples in the form of a table ‘T’ where each row corresponds to an example
and each column contains an attribute value.
ii. Create a set of m training examples, each example composed of k attributes and a
class attribute with n possible decisions.
iii. Create a rule set, R, having the initial value false.
iv. Initially, all rows in the table are unmarked.
Necessary Steps for Implementation
 Step 1: divide the table ‘T’ containing m examples into n sub-tables (t1, t2,…..tn).
One table for each possible value of the class attribute. (repeat steps 2-8 for each
sub-table)
 Step 2: Initialize the attribute combination count ‘ j ‘ = 1.
 Step 3: For the sub-table on which work is going on, divide the attribute list into
distinct combinations, each combination with ‘j ‘ distinct attributes.
 Step 4: For each combination of attributes, count the number of occurrences of
attribute values that appear under the same combination of attributes in unmarked
rows of the sub-table under consideration, and at the same time, not appears under
the same combination of attributes of other sub-tables. Call the first combination
with the maximum number of occurrences the max-combination ‘ MAX’.
 Step 5: If ‘MAX’ == null, increase ‘ j ‘ by 1 and go to Step 3.
 Step 6: Mark all rows of the sub-table where working, in which the values of ‘MAX’
appear, as classified.
 Step 7: Add a rule (IF attribute = “XYZ” –> THEN decision is YES/ NO) to R whose
left-hand side will have attribute names of the ‘MAX’ with their values separated by
AND, and its right-hand side contains the decision attribute value associated with
the sub-table.
 Step 8: If all rows are marked as classified, then move on to process another sub-table
and go to Step 2. Else, go to Step 4. If no sub-tables are available, exit with the set
of rules obtained till then.

109
Lecture: 19

3.5 Entropy and Information Gain


Entropy and information gain are key concepts in domains such as information
theory, data science, and machine learning. Information gain is the amount of
knowledge acquired during a certain decision or action, whereas entropy is a
measure of uncertainty or unpredictability. People can handle difficult situations and
make wise judgments across a variety of disciplines when they have a solid
understanding of these principles. Entropy can be used in data science, for instance,
to assess the variety or unpredictable nature of a dataset, whereas Information Gain
can assist in identifying the qualities that would be most useful to include in a model.
In this article, we'll examine the main distinctions between entropy and information
gain and how they affect machine learning.
What is Entropy?
The term "entropy" comes from the study of thermodynamics, and it describes how
chaotic or unpredictable a system is. Entropy is a measurement of a data set's
impurity in the context of machine learning. In essence, it is a method of calculating
the degree of uncertainty in a given dataset.
The following formula is used to compute entropy −
Entropy(S)=−p1log2p1−p2log2p2−…−pnlog2pn
S is the data set, and p1 through pn are the percentages of various classes inside the
data. The resultant entropy value is expressed in bits since the base 2 logarithm used
in this method is typical.
Consider a dataset with two classes, A and B, in order to comprehend this formula.
The entropy can be determined as follows if 80% of the data is in class A and 20%
is in class B −
Entropy(S)=−0.8log20.8−0.2log20.2=0.72bits
This indicates that the dataset is impurity-rich, with entropy of 0.72 bits.
3.6 What is Information Gain?
Information Gain is a statistical metric used to assess a feature's applicability in a
dataset. It is an important idea in machine learning and is frequently utilized in
decision tree algorithms. By contrasting the dataset's entropy before and after a
feature is separated, information gain is estimated. A feature's relevance to the
categorization of the data increases with information gain.
When the dataset has been divided based on a feature, information gain calculates
the entropy decrease. The amount of knowledge a feature imparts about the class is
measured by this metric. Selecting the characteristic that provides the most
information about the class will help you achieve your aim of maximizing
information gain.
110
The following formula is used to compute information gain −
InformationGain(S,A)=Entropy(S)–∑(|Sv|/|S|)∗Entropy(Sv)
The number of elements in Sv is given by |Sv|, where S is the set of data, A is a
feature, Sv is the subset of S for which feature A takes the value v, and S is the total
number of elements in S.
Think of a dataset with two characteristics, X and Y, to better comprehend this
formula. The information gain can be calculated as follows if the data is to be divided
based on characteristic X −
InformationGain(S,X)=Entropy(S)–[(3/5)∗Entropy(S1)+(2/5)∗Entropy(S2)]
where S1 is the subset of data where feature X takes a value of 0, and S2 is the subset
of data where feature X takes a value of 1. These two subsets' entropies, Entropy(S1)
and Entropy(S2), can be determined using the formula we previously covered.
The amount by which the dataset will be divided based on characteristic X will be
shown by the information gain those results.

3.7 Key Differences between Entropy and Information Gain


Table 3.1: Entropy v/s Information Gain

Entropy Information Gain

Entropy is a measurement
Information gain is a metric for the
of the disorder or impurity
entropy reduction brought about by
of a set of occurrences. It
segmenting a set of instances
determines the usual
according to a feature. It gauges the
amount of information
amount of knowledge a characteristic
needed to classify a sample
imparts to the class of an example.
taken from the collection.

By dividing the collection of


instances depending on the feature
Entropy is calculated for a
and calculating the entropies of the
set of examples by
resulting subsets, information gain is
calculating the probability
determined for each feature. The
of each class in the set and
difference between the entropy of the
using that information in the
original set and the weighted sum of
entropy calculation.
the entropies of the subsets is thus the
information gain.

Entropy quantifies the By choosing the feature with the

111
disorder or impurity present maximum information gain, the
in a collection of instances objective of information gain is to
and aims to be minimized maximize the utility of a feature for
by identifying the ideal categorization.
division.

Entropy is typically taken Decision trees frequently employ


into account by decision information gain as a criterion for
trees for determining the choosing the optimal feature to split
best split. on.

Splits that produce imbalanced


Entropy usually favors
subsets with pure classes are
splits that result in balanced
frequently preferred by information
subgroups.
gain.

Entropy can control By choosing the split point that


continuous characteristics maximizes the information
by discretizing them into acquisition, continuous features may
bins. also be handled.

Calculating probabilities
Entropies and weighted averages
and logarithms, which can
must be calculated in order to gather
be computationally costly,
information, which can be
is necessary to determine
computationally costly.
entropy.

Entropy is a versatile For binary classification issues,


indicator of impurity that information gain is a particular
may be applied to a variety measure of feature usefulness that
of classification issues. works well.

Entropy, which is given in Information gain, which is also stated


bits, calculates the typical in bits, indicates the reduction in
amount of data required to uncertainty attained by splitting
categorize an example. based on a feature.

If there are too many If the tree is too deep or there are too
characteristics or the tree is many irrelevant characteristics,
too deep, entropy might information gain may potentially
result in overfitting. result in overfitting.

112
Lecture: 20

3.8 ID3 Algorithm

This article targets to clearly explain the ID3 Algorithm (one of the many
Algorithms used to build Decision Trees) in detail. We explain the algorithm using
a fake sample Covid-19 dataset.

What are Decision Trees?


In simple words, a decision tree is a structure that contains nodes (rectangular boxes)
and edges(arrows) and is built from a dataset (table of columns representing
features/attributes and rows corresponds to records). Each node is either used
to make a decision (known as decision node) or represent an outcome (known as
leaf node).
Decision tree Example

Figure 3.3: Decision Tree

The picture above depicts a decision tree that is used to classify whether a person
is Fit or Unfit.
The decision nodes here are questions like ‘’‘Is the person less than 30 years of
age?’, ‘Does the person eat junk?’, etc. and the leaves are one of the two possible
outcomes viz. Fit and Unfit.
Looking at the Decision Tree we can say make the following decisions:
if a person is less than 30 years of age and doesn’t eat junk food then he is Fit, if a
person is less than 30 years of age and eats junk food then he is Unfit and so on.
The initial node is called the root node (colored in blue), the final nodes are called
the leaf nodes (colored in green) and the rest of the nodes are
called intermediate or internal nodes.
The root and intermediate nodes represent the decisions while the leaf nodes
represent the outcomes.
113
ID3
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm
iteratively (repeatedly) dichotomizes(divides) features into two or more groups at
each step.
Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision
tree. In simple words, the top-down approach means that we start building the tree
from the top and the greedy approach means that at each iteration we select the best
feature at the present moment to create a node.
Most generally ID3 is only used for classification problems with nominal features
only.
Dataset description
In this article, we’ll be using a sample dataset of COVID-19 infection. A preview of
the entire dataset is shown below.
+----+-------+-------+------------------+----------+

| ID | Fever | Cough | Breathing issues | Infected |

+----+-------+-------+------------------+----------+

| 1 | NO | NO | NO | NO |

+----+-------+-------+------------------+----------+

| 2 | YES | YES | YES | YES |

+----+-------+-------+------------------+----------+

| 3 | YES | YES | NO | NO |

+----+-------+-------+------------------+----------+

| 4 | YES | NO | YES | YES |

+----+-------+-------+------------------+----------+

| 5 | YES | YES | YES | YES |

114
+----+-------+-------+------------------+----------+

| 6 | NO | YES | NO | NO |

+----+-------+-------+------------------+----------+

| 7 | YES | NO | YES | YES |

+----+-------+-------+------------------+----------+

| 8 | YES | NO | YES | YES |

+----+-------+-------+------------------+----------+

| 9 | NO | YES | YES | YES |

+----+-------+-------+------------------+----------+

| 10 | YES | YES | NO | YES |

+----+-------+-------+------------------+----------+

| 11 | NO | YES | NO | NO |

+----+-------+-------+------------------+----------+

| 12 | NO | YES | YES | YES |

+----+-------+-------+------------------+----------+

| 13 | NO | YES | YES | NO |

+----+-------+-------+------------------+----------+

| 14 | YES | YES | NO | NO |

+----+-------+-------+------------------+----------+
115
The columns are self-explanatory. Y and N stand for Yes and No respectively. The
values or classes in Infected column Y and N represent Infected and Not Infected
respectively.
The columns used to make decision nodes viz. ‘Breathing Issues’, ‘Cough’ and
‘Fever’ are called feature columns or just features and the column used for leaf nodes
i.e. ‘Infected’ is called the target column.
Metrics in ID3
As mentioned previously, the ID3 algorithm selects the best feature at each step
while building a Decision tree.
Before you ask, the answer to the question: ‘How does ID3 select the best feature?’
is that ID3 uses Information Gain or just Gain to find the best feature.
Information Gain calculates the reduction in the entropy and measures how well a
given feature separates or classifies the target classes. The feature with the highest
Information Gain is selected as the best one.
In simple words, Entropy is the measure of disorder and the Entropy of a dataset is
the measure of disorder in the target feature of the dataset.
In the case of binary classification (where the target column has only two types of
classes) entropy is 0 if all values in the target column are homogenous(similar) and
will be 1 if the target column has equal number values for both the classes.
We denote our dataset as S, entropy is calculated as:

Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
where,
n is the total number of classes in the target column
(in our case n = 2 i.e YES and NO)
pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the
target column” to the “total number of rows” in the dataset.
Information Gain for a feature column A is calculated as:

IG(S, A) = Entropy(S) - ∑((|Sᵥ| / |S|) * Entropy(Sᵥ))


where Sᵥ is the set of rows in S for which the feature column A has value v, |Sᵥ| is
the number of rows in Sᵥ and likewise |S| is the number of rows in S.
ID3 Steps
i. Calculate the Information Gain of each feature.
ii. Considering that all rows don’t belong to the same class, split the dataset S into
subsets using the feature for which the Information Gain is maximum.
iii. Make a decision tree node using the feature with the maximum Information gain.
116
iv. If all rows belong to the same class, make the current node as a leaf node with the
class as its label.
v. Repeat for the remaining features until we run out of all features, or the decision tree
has all leaf nodes.
Implementation on our Dataset
As stated in the previous section the first step is to find the best feature i.e. the one
that has the maximum Information Gain(IG). We’ll calculate the IG for each of the
features now, but for that, we first need to calculate the entropy of S
From the total of 14 rows in our dataset S, there are 8 rows with the target
value YES and 6 rows with the target value NO. The entropy of S is calculated as:

Entropy(S) = — (8/14) * log₂(8/14) — (6/14) * log₂(6/14) = 0.99


Note: If all the values in our target column are same the entropy will be zero
(meaning that it has no or zero randomness).
We now calculate the Information Gain for each feature:
IG calculation for Fever:
In this(Fever) feature there are 8 rows having value YES and 6 rows having
value NO.
As shown below, in the 8 rows with YES for Fever, there are 6 rows having target
value YES and 2 rows having target value NO.
+-------+-------+------------------+----------+

| Fever | Cough | Breathing issues | Infected |

+-------+-------+------------------+----------+

| YES | YES | YES | YES |

+-------+-------+------------------+----------+

| YES | YES | NO | NO |

+-------+-------+------------------+----------+

| YES | NO | YES | YES |

+-------+-------+------------------+----------+
117
| YES | YES | YES | YES |

+-------+-------+------------------+----------+

| YES | NO | YES | YES |

+-------+-------+------------------+----------+

| YES | NO | YES | YES |

+-------+-------+------------------+----------+

| YES | YES | NO | YES |

+-------+-------+------------------+----------+

| YES | YES | NO | NO |

+-------+-------+------------------+----------+
As shown below, in the 6 rows with NO, there are 2 rows having target
value YES and 4 rows having target value NO.
+-------+-------+------------------+----------+

| Fever | Cough | Breathing issues | Infected |

+-------+-------+------------------+----------+

| NO | NO | NO | NO |

+-------+-------+------------------+----------+

| NO | YES | NO | NO |

+-------+-------+------------------+----------+

118
| NO | YES | YES | YES |

+-------+-------+------------------+----------+

| NO | YES | NO | NO |

+-------+-------+------------------+----------+

| NO | YES | YES | YES |

+-------+-------+------------------+----------+

| NO | YES | YES | NO |

+-------+-------+------------------+----------+
The block, below, demonstrates the calculation of Information Gain for Fever.
# total rows

|S| = 14For v = YES, |Sᵥ| = 8

Entropy(Sᵥ) = - (6/8) * log₂(6/8) - (2/8) * log₂(2/8) = 0.81For v = NO, |Sᵥ| = 6

Entropy(Sᵥ) = - (2/6) * log₂(2/6) - (4/6) * log₂(4/6) = 0.91#

Expanding the summation in the IG formula:

IG(S, Fever) = Entropy(S) - (|Sʏᴇꜱ| / |S|) * Entropy(Sʏᴇꜱ) -

(|Sɴᴏ| / |S|) * Entropy(Sɴᴏ)∴ IG(S, Fever) = 0.99 - (8/14) * 0.81 - (6/14) * 0.91 =
0.13
Next, we calculate the IG for the features “Cough” and “Breathing issues”.
You can use this free online tool to calculate the Information Gain.

119
IG(S, Cough) = 0.04

IG(S, BreathingIssues) = 0.40


Since the feature Breathing issues have the highest Information Gain it is used to
create the root node.
Hence, after this initial step our tree looks like this:

Next, from the remaining two unused features, namely, Fever and Cough, we decide
which one is the best for the left branch of Breathing Issues.
Since the left branch of Breathing Issues denotes YES, we will work with the subset
of the original data i.e the set of rows having YES as the value in the Breathing Issues
column. These 8 rows are shown below:
+-------+-------+------------------+----------+

| Fever | Cough | Breathing issues | Infected |

+-------+-------+------------------+----------+

| YES | YES | YES | YES |

+-------+-------+------------------+----------+

| YES | NO | YES | YES |

+-------+-------+------------------+----------+

| YES | YES | YES | YES |

+-------+-------+------------------+----------+

| YES | NO | YES | YES |

+-------+-------+------------------+----------+

120
| YES | NO | YES | YES |

+-------+-------+------------------+----------+

| NO | YES | YES | YES |

+-------+-------+------------------+----------+

| NO | YES | YES | YES |

+-------+-------+------------------+----------+

| NO | YES | YES | NO |

+-------+-------+------------------+----------+
Next, we calculate the IG for the features Fever and Cough using the subset Sʙʏ
(Set Breathing Issues Yes) which is shown above :
Note: For IG calculation the Entropy will be calculated from the subset Sʙʏ and not
the original dataset S.

IG(Sʙʏ, Fever) = 0.20

IG(Sʙʏ, Cough) = 0.09


IG of Fever is greater than that of Cough, so we select Fever as the left branch of
Breathing Issues:
Our tree now looks like this:

Figure 3.4: Example

Next, we find the feature with the maximum IG for the right branch of Breathing
Issues. But, since there is only one unused feature left we have no other choice but
to make it the right branch of the root node.
121
So our tree now looks like this:

Figure 3.5: Example

There are no more unused features, so we stop here and jump to the final step of
creating the leaf nodes.
For the left leaf node of Fever, we see the subset of rows from the original data set
that has Breathing Issues and Fever both values as YES.
+-------+-------+------------------+----------+

| Fever | Cough | Breathing issues | Infected |

+-------+-------+------------------+----------+

| YES | YES | YES | YES |

+-------+-------+------------------+----------+

| YES | NO | YES | YES |

+-------+-------+------------------+----------+

| YES | YES | YES | YES |

+-------+-------+------------------+----------+

| YES | NO | YES | YES |

+-------+-------+------------------+----------+

| YES | NO | YES | YES |

122
+-------+-------+------------------+----------+
Since all the values in the target column are YES, we label the left leaf node as YES,
but to make it more logical we label it Infected.
Similarly, for the right node of Fever we see the subset of rows from the original
data set that have Breathing Issues value as YES and Fever as NO.
+-------+-------+------------------+----------+

| Fever | Cough | Breathing issues | Infected |

+-------+-------+------------------+----------+

| NO | YES | YES | YES |

+-------+-------+------------------+----------+

| NO | YES | YES | NO |

+-------+-------+------------------+----------+

| NO | YES | YES | NO |

+-------+-------+------------------+----------+
Here not all but most of the values are NO, hence NO or Not Infected becomes
our right leaf node.
Our tree, now, looks like this:

Figure 3.6: Example

We repeat the same process for the node Cough, however here both left and right
leaves turn out to be the same i.e. NO or Not Infected as shown below:

123
Figure 3.7: Example

Looks Strange, doesn’t it?


I know! The right node of Breathing issues is as good as just a leaf node with class
‘Not infected’. This is one of the Drawbacks of ID3, it doesn’t do pruning.
Pruning is a mechanism that reduces the size and complexity of a Decision tree by
removing unnecessary nodes.
Another drawback of ID3 is overfitting or high variance i.e. it learns the dataset it
used so well that it fails to generalize on new data.

Figure 3.8: Example

124
Lecture: 21
3.9 k-NN Learning
K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s
nearby. Imagine a streaming service wants to predict if a new user is likely to cancel
their subscription (churn) based on their age. They checks the ages of its existing
users and whether they churned or stayed. If most of the “K” closest users in age of
new user canceled their subscription KNN will predict the new user might churn too.
The key idea is that users with similar ages tend to have similar behaviors and KNN
uses this closeness to make decisions.

K-Nearest Neighbors is also called as a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the dataset and at the time
of classification it performs an action on the dataset.

In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the
algorithm how many nearby points (neighbours) to look at when it makes a decision.

Example:

Imagine you’re deciding which fruit it is based on its shape and size. You compare
it to fruits you already know.

 If k = 3, the algorithm looks at the 3 closest fruits to the new one.

 If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is
an apple because most of its neighbours are apples.

How to choose the value of k for KNN Algorithm?

The value of k is critical in KNN as it determines the number of neighbors to consider


when making predictions. Selecting the optimal value of k depends on the
characteristics of the input data. If the dataset has significant outliers or noise a
higher k can help smooth out the predictions and reduce the influence of noisy data.
However choosing very high value can lead to underfitting where the model
becomes too simplistic.

Statistical Methods for Selecting k:

 Cross-Validation: A robust method for selecting the best k is to perform k-fold cross-
validation. This involves splitting the data into k subsets training the model on some
125
subsets and testing it on the remaining ones and repeating this for each subset. The
value of k that results in the highest average validation accuracy is usually the best
choice.

 Elbow Method: In the elbow method we plot the model’s error rate or accuracy for
different values of k. As we increase k the error usually decreases initially. However
after a certain point the error rate starts to decrease more slowly. This point where the
curve forms an “elbow” that point is considered as best k.

 Odd Values for k: It’s also recommended to choose an odd value for k especially in
classification tasks to avoid ties when deciding the majority class.

Distance Metrics Used in KNN Algorithm

KNN uses distance metrics to identify nearest neighbour, these neighbours are used
for classification and regression task. To identify nearest neighbour we use below
distance metrics:

1. Euclidean Distance

Euclidean distance is defined as the straight-line distance between two points in a


plane or space. You can think of it like the shortest path you would walk if you were
to go directly from one point to another.

distance(x,Xi)=∑j=1d(xj–Xij)2]distance(x,Xi)=∑j=1d(xj–Xij)2]

2. Manhattan Distance

This is the total distance you would travel if you could only move along horizontal
and vertical lines (like a grid or city streets). It’s also called “taxicab distance”
because a taxi can only drive along the grid-like streets of a city.

d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣

3. Minkowski Distance

Minkowski distance is like a family of distances, which includes


both Euclidean and Manhattan distances as special cases.

d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1

126
From the formula above we can say that when p = 2 then it is the same as the formula
for the Euclidean distance and when p = 1 then we obtain the formula for the
Manhattan distance.

So, you can think of Minkowski as a flexible distance formula that can look like
either Manhattan or Euclidean distance depending on the value of p.

127
Lecture: 22
3.10 Locally Weighted Regression
Locally Weighted Linear Regression (LWLR) is a non-parametric, memory-
based algorithm designed to capture non-linear relationships in data. Unlike
traditional regression models that fit a single global line across the dataset, LWLR
creates localized models for subsets of data points near the query point. Each query
point has its own regression line based on weighted contributions from nearby data
points.

LWLR assigns weights to data points based on their proximity to the query point:

 Points closer to the query point have higher weights.

 Points farther away have lower weights.

This approach allows LWLR to adapt to local data structures, making it effective for
modeling non-linear relationships.

Comparison with Global Linear Regression

 Global Linear Regression: Fits a single line to the entire dataset, assuming a uniform
relationship across all data points.

 Locally Weighted Linear Regression: Fits multiple localized lines, adapting to


variations in different parts of the dataset.

For instance, in predicting housing prices, LWLR can handle neighborhoods with
distinct pricing trends better than global linear regression, which might oversimplify
the relationship.

Example of Locally Weighted Linear Regression

Let’s consider a scenario where we want to predict housing prices based on the size
of the house. In a dataset, the relationship between size and price might vary across
different neighborhoods due to local factors like amenities or location.

 Global Linear Regression: Fits a single line to the entire dataset, assuming a uniform
relationship between size and price across all neighborhoods. This may lead to
inaccurate predictions in areas where the relationship deviates.

128
 Locally Weighted Linear Regression: Focuses on the specific neighborhood by
giving more weight to houses closer in size to the query house. This results in a better
prediction tailored to the local trends.

Visualization

Imagine plotting housing prices against house sizes:

 A global regression line might cut straight through the dataset.

 LWLR would fit smaller localized lines that closely follow the variations in data for
each neighborhood.

This adaptability allows LWLR to model more complex relationships, such as sharp
changes in housing prices in specific regions.

Steps Involved in Locally Weighted Linear Regression

The process of Locally Weighted Linear Regression involves several key steps,
ensuring the model captures local patterns effectively.

1. Data Collection and Preparation

 Gather a dataset with relevant features and a target variable.

 Preprocess the data by handling missing values and normalizing features to ensure a
consistent scale, which improves the weighting process.

2. Choose the Kernel and Bandwidth (Tau)

 Kernel Function: Determines how weights are assigned to data points based on their
distance from the query point. Common choices include:

 Gaussian Kernel: Assigns weights using the formula:

Let’s consider a scenario where we want to predict housing prices based on the size
of the house. In a dataset, the relationship between size and price might vary across
different neighborhoods due to local factors like amenities or location.

 Global Linear Regression: Fits a single line to the entire dataset, assuming a uniform
relationship between size and price across all neighborhoods. This may lead to
inaccurate predictions in areas where the relationship deviates.
129
 Locally Weighted Linear Regression: Focuses on the specific neighborhood by
giving more weight to houses closer in size to the query house. This results in a better
prediction tailored to the local trends.

Visualization

Imagine plotting housing prices against house sizes:

 A global regression line might cut straight through the dataset.

 LWLR would fit smaller localized lines that closely follow the variations in data for
each neighborhood.

This adaptability allows LWLR to model more complex relationships, such as sharp
changes in housing prices in specific regions.

Steps Involved in Locally Weighted Linear Regression

The process of Locally Weighted Linear Regression involves several key steps,
ensuring the model captures local patterns effectively.

1. Data Collection and Preparation

 Gather a dataset with relevant features and a target variable.

 Preprocess the data by handling missing values and normalizing features to ensure a
consistent scale, which improves the weighting process.

2. Choose the Kernel and Bandwidth (Tau)

 Kernel Function: Determines how weights are assigned to data points based on their
distance from the query point. Common choices include:

 Gaussian Kernel: Assigns weights using the formula:

 Here, is the weight for data point , and τ\tauτ (bandwidth) controls the rate of weight

130
decay.

 Bandwidth (Tau): A critical parameter that governs how localized the regression is:

 Small : Focuses on nearby points, capturing finer details but risks overfitting.

 Large : Includes more distant points, reducing variance but increasing bias.

3. Weight Calculation

For a given query point , compute weights for all data points using the chosen kernel
function. Points closer to will have higher weights.

4. Model Fitting

Using the computed weights, fit a weighted least squares regression to the data.
The goal is to minimize the weighted sum of squared errors:

5. Prediction

Once the localized model is fitted, use it to predict the target value for the query
point.

131
Lecture: 23

3.11 Radial Basis Function Networks


Radial Basis Function (RBF) Neural Networks are a specialized type of Artificial
Neural Network (ANN) used primarily for function approximation tasks. Known for
their distinct three-layer architecture and universal approximation capabilities, RBF
Networks offer faster learning speeds and efficient performance in classification and
regression problems. This article delves into the workings, architecture, and
applications of RBF Neural Networks.

3.11.1 What are Radial Basis Functions?


Radial Basis Functions (RBFs) are a special category of feed-forward neural
networks comprising three layers:
i. Input Layer: Receives input data and passes it to the hidden layer.
ii. Hidden Layer: The core computational layer where RBF neurons process the
data.
iii. Output Layer: Produces the network’s predictions, suitable for classification
or regression tasks.

3.11.2 How Do RBF Networks Work?


RBF Networks are conceptually similar to K-Nearest Neighbor (k-NN) models,
though their implementation is distinct. The fundamental idea is that an item's
predicted target value is influenced by nearby items with similar predictor variable
values. Here’s how RBF Networks operate:
i. Input Vector: The network receives an n-dimensional input vector that needs
classification or regression.
ii. RBF Neurons: Each neuron in the hidden layer represents a prototype vector
from the training set. The network computes the Euclidean distance between
the input vector and each neuron's center.
iii. Activation Function: The Euclidean distance is transformed using a Radial
Basis Function (typically a Gaussian function) to compute the neuron’s
activation value. This value decreases exponentially as the distance increases.
iv. Output Nodes: Each output node calculates a score based on a weighted sum
of the activation values from all RBF neurons. For classification, the category
with the highest score is chosen.
3.11.3 Key Characteristics of RBFs
i. Radial Basis Functions: These are real-valued functions dependent solely on
the distance from a central point. The Gaussian function is the most commonly
132
used type.
ii. Dimensionality: The network's dimensions correspond to the number of
predictor variables.
iii. Center and Radius: Each RBF neuron has a center and a radius (spread). The
radius affects how broadly each neuron influences the input space.
3.11.4 Architecture of RBF Networks
The architecture of an RBF Network typically consists of three layers:
i. Input Layer
Function: After receiving the input features, the input layer sends them
straight to the hidden layer.
Components: It is made up of the same number of neurons as the
characteristics in the input data. One feature of the input vector corresponds to
each neuron in the input layer.
ii. Hidden Layer
Function: This layer uses radial basis functions (RBFs) to conduct the non-
linear transformation of the input data.
Components: Neurons in the buried layer apply the RBF to the incoming
data. The Gaussian function is the RBF that is most frequently utilized.
iii. RBF Neurons: Every neuron in the hidden layer has a spread parameter (σ)
and a center, which are also referred to as prototype vectors. The spread
parameter modulates the distance between the center of an RBF neuron and
the input vector, which in turn determines the neuron's output.
iv. Output Layer
Function: The output layer uses weighted sums to integrate the hidden layer
neurons' outputs to create the network's final output.
Components: It is made up of neurons that combine the outputs of the hidden
layer in a linear fashion. To reduce the error between the network's predictions
and the actual target values, the weights of these combinations are changed
during training.
3.11.5 Training Process of radial basis function neural network
An RBF neural network must be trained in three stages: choosing the center's, figuring
out the spread parameters, and training the output weights.
Step 1: Selecting the Centers
 Techniques for Centre Selection: Centre's can be picked at random from the
training set of data or by applying techniques such as k-means clustering.
 K-Means Clustering: The center's of these clusters are employed as the center's
for the RBF neurons in this widely used center selection technique, which groups
the input data into k groups.
133
Step 2: Determining the Spread Parameters
 The spread parameter (σ) governs each RBF neuron's area of effect and
establishes the width of the RBF.
 Calculation: The spread parameter can be manually adjusted for each neuron or
set as a constant for all neurons. Setting σ based on the separation between the
center's is a popular method, frequently accomplished with the help of a heuristic
like dividing the greatest distance between canters by the square root of twice the
number of center's
Step 3: Training the Output Weights
 Linear Regression: The objective of linear regression techniques, which are
commonly used to estimate the output layer weights, is to minimize the error
between the anticipated output and the actual target values.
 Pseudo-Inverse Method: One popular technique for figuring out the weights is
to utilize the pseudo-inverse of the hidden layer outputs matrix

134
Lecture: 24
3.12 Case-Based Learning

As we know Nearest Neighbour classifiers stores training tuples as points in


Euclidean space. But Case-Based Reasoning classifiers (CBR) use a database of
problem solutions to solve new problems. It stores the tuples or cases for problem-
solving as complex symbolic descriptions. How CBR works? When a new case
arises to classify, a Case-based Reasoner(CBR) will first check if an identical
training case exists. If one is found, then the accompanying solution to that case is
returned. If no identical case is found, then the CBR will search for training cases
having components that are similar to those of the new case. Conceptually, these
training cases may be considered as neighbours of the new case. If cases are
represented as graphs, this involves searching for subgraphs that are similar to
subgraphs within the new case. The CBR tries to combine the solutions of the
neighbouring training cases to propose a solution for the new case. If compatibilities
arise with the individual solutions, then backtracking to search for other solutions
may be necessary. The CBR may employ background knowledge and problem-
solving strategies to propose a feasible solution. Applications of CBR includes:

Problem resolution for customer service help desks, where cases describe product-
related diagnostic problems.
It is also applied to areas such as engineering and law, where cases are either
technical designs or legal rulings, respectively.
Medical educations, where patient case histories and treatments are used to help
diagnose and treat new patients.

3.12.1 Challenges with CBR

 Finding a good similarity metric (eg for matching subgraphs) and suitable
methods for combining solutions.
 Selecting salient features for indexing training cases and the development of
efficient indexing techniques.
CBR becomes more intelligent as the number of the trade-off between accuracy and
efficiency evolves as the number of stored cases becomes very large. But after a
certain point, the system’s efficiency will suffer as the time required to search for
and process relevant cases increases.

135
3.13 Important Questions (PYQs)
Q1: Explain ID3 Algorithm.
Q2: What is the limitation of Decision Tree?
Q3: Discuss why we use SVM Kernels and in which scenario which SVM kernel is used?
Q4: Discuss the various issues of Decision tree.
Q5: Explain instance based learning with representation?
Q6: How Locally Weighted Regression is different from Radial Basis function networks?
Q7: Explain KNN Algorithm with suitable example.
Q8: Differentiate between Lazy and Eager Learning
Q9: Illustrate the operation of the ID3 training example. Consider the information gain as
attribute measure.
Q10: What are the steps used for making Decision Tree?
Q11: Explain Attribute Selection Measures used in Decision Tree.
Q12: Explain Inductive Bias with Inductive System.
Q13: Explain Inductive Learning Algorithm. Which learning algorithms used in inductive bias?
Q14: What are the Performance Dimensions used for Instance based learning system?
Q15: Explain the Functions, Advantages and Disadvantages of Instance Based Learning.
Q16: Explain Locally Weighted Regression.
Q17: Explain the Architecture of Radial Basis Function Network.
Q18: What are the Functions, Advantages and Disadvantages of Case Based Learning System?
Q19: Describe Case Based Learning Cycle with Limitations, Benefits and Applications.
Q20: What are the Advantages and Disadvantages of KNN Algorithm?

136
UNIT 4- Artificial Neural Networks

To introduce students to the concept of Artificial Neural Networks


WHY (ANNs), explain their importance in machine learning, and
demonstrate their applications in real-world problems.
Introduction to Artificial Neural Networks:

 What is an artificial neural network?


WHAT
 A simple analogy to biological neural networks (brain neurons).
 History and evolution of neural networks.

Utilise online tools or environments such as Jupyter Notebooks or


WHERE Google Colab for live coding demonstrations and hands-on practice
with neural networks.

Lecture: 25

4.1 Perceptron in Machine Learning


In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used term
for all folks. It is the primary step to learn Machine Learning and Deep Learning
technologies, which consists of a set of weights, input values or scores, and a
threshold. Perceptron is a building block of an Artificial Neural Network. Initially, in the
mid of 19th century, Mr. Frank Rosenblatt invented the Perceptron for performing certain
calculations to detect input data capabilities or business intelligence. Perceptron is a linear
Machine Learning algorithm used for supervised learning for various binary classifiers. This
algorithm enables neurons to learn elements and processes them one by one during
preparation. In this tutorial, "Perceptron in Machine Learning," we will discuss in-depth
knowledge of Perceptron and its basic functions in brief. Let's start with the basic
introduction of Perceptron.

4.1.1 What is the Perceptron model in Machine Learning?


Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks. Further, Perceptron is also understood as an Artificial Neuron or neural
network unit that helps to detect certain input data computations in business intelligence.

Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we
can consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.

137
4.1.2 What is Binary classifier in Machine Learning?
In Machine Learning, binary classifiers are defined as the function that helps in deciding
whether input data can be represented as vectors of numbers and belongs to some specific
class.

Binary classifiers can be considered as linear classifiers. In simple words, we can understand
it as a classification algorithm that can predict linear predictor function in terms of weight
and feature vectors.

4.1.3 Basic Components of Perceptron


Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains
three main components. These are as follows:

Figure 4.1: Structure of an Artificial Neuron

 Input Nodes or Input Layer:

This is the primary component of Perceptron which accepts the initial data into the system
for further processing. Each input node contains a real numerical value.

 Wight and Bias:

Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.

 Activation Function:
These are the final and important components that help to determine whether the neuron will
fire or not. Activation Function can be considered primarily as a step function.

 Types of Activation functions:


i. Sign function
ii. Step function, and
138
iii. Sigmoid function

Figure 4.2: Activation Functions

The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign,
Step, and Sigmoid) in perceptron models by checking whether the learning process is slow
or has vanishing or exploding gradients.

4.2 How does Perceptron work?

In Machine Learning, Perceptron is considered as a single-layer neural network that consists


of four main parameters named input values (Input nodes), weights and Bias, net sum, and
an activation function. The perceptron model begins with the multiplication of all input
values and their weights, then adds these values together to create the weighted sum. Then
this weighted sum is applied to the activation function 'f' to obtain the desired output. This
activation function is also known as the step function and is represented by 'f'.

139
Figure 4.3: Structure of ANN

This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is
indicative of the strength of a node. Similarly, an input's bias value gives the ability to shift
the activation function curve up or down.

Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight values and then
add them to determine the weighted sum. Mathematically, we can calculate the weighted
sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's performance.

∑wi*xi + b

Step-2

In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:

Y = f(∑wi*xi + b)

140
4.3 Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These are as follows:

I. Single-layer Perceptron Model


II. Multi-layer Perceptron model

i. Single Layer Perceptron Model:


This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold transfer
function inside the model. The main objective of the single-layer perceptron model is
to analyze the linearly separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters. Further, it sums up all
inputs (weight). After adding all inputs, if the total sum of all inputs is more than a pre-
determined value, the model gets activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of
this model is stated as satisfied, and weight demand does not change. However, this
model consists of a few discrepancies triggered when multiple weight inputs values are
fed into the model. Hence, to find desired output and minimize errors, some changes
should be necessary for the weights input.
"Single-layer perceptron can learn only linearly separable patterns."
ii. Multi-Layered Perceptron Model:
Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm,
which executes in two stages as follows
Forward Stage: Activation functions start from the input layer in the forward stage and
terminate on the output layer.
Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural
networks having various layers in which activation function does not remain linear,
similar to a single layer perceptron model. Instead of linear, activation function can be
executed as sigmoid, TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear
and non-linear patterns. Further, it can also implement logic gates such as AND, OR,
XOR, NAND, NOT, XNOR, NOR.

141
4.3.1 Advantages of Multi-Layer Perceptron:
A multi-layered perceptron model can be used to solve complex non-linear problems.
It works well with both small and large input data.
It helps us to obtain quick predictions after the training.
It helps to obtain the same accuracy ratio with large as well as small data.
4.3.2 Disadvantages of Multi-Layer Perceptron:
In multi-layer perceptron, computations are difficult and time-consuming.
In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects
each independent variable.
The model functioning depends on the quality of the training.
 Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'.
Mathematically, we can express it as follows:
f(x)=1; if w.x+b>0
otherwise, f(x)=0
'w' represents real-valued weights vector
'b' represents the bias
'x' represents a vector of input x values.
 Characteristics of Perceptron
The perceptron model has the following characteristics.
Perceptron is a machine learning algorithm for supervised learning of binary classifiers.
In Perceptron, the weight coefficient is automatically learned.
Initially, weights are multiplied with input features, and the decision is made whether the
neuron is fired or not.
The activation function applies a step rule to check whether the weight function is greater
than zero.
The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.
If the added sum of all input values is more than the threshold value, it must have an output
signal; otherwise, no output will be shown.
 Limitations of Perceptron Model

142
A perceptron model has limitations as follows:
The output of a perceptron can only be a binary number (0 or 1) due to the hard limit transfer
function.
Perceptron can only be used to classify the linearly separable sets of input vectors. If input
vectors are non-linear, it is not easy to classify them properly.
 Future of Perceptron

The future of the Perceptron model is much bright and significant as it helps to interpret data
by building intuitive patterns and applying them in the future. Machine learning is a rapidly
growing technology of Artificial Intelligence that is continuously evolving and in the
developing phase; hence the future of perceptron technology will continue to support and
facilitate analytical behavior in machines that will, in turn, add to the efficiency of
computers.

The perceptron model is continuously becoming more advanced and working efficiently on
complex problems with the help of artificial neurons.

143
Lecture: 26
4.4 Gradient Descent in Machine Learning
Gradient Descent is known as one of the most commonly used optimization algorithms to
train machine learning models by means of minimizing errors between actual and expected
results. Further, gradient descent is also used to train Neural Networks.

In mathematical terminology, Optimization algorithm refers to the task of


minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in machine
learning, optimization is the task of minimizing the cost function parameterized by the
model's parameters. The main objective of gradient descent is to minimize the convex
function using iteration of parameter updates. Once these machine learning models are
optimized, these models can be used as powerful tools for Artificial Intelligence and various
computer science applications.

In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about
gradient descent, the role of cost functions specifically as a barometer within Machine
Learning, types of gradient descents, learning rates, etc.

4.4.1 What is Gradient Descent or Steepest Descent?


Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of 18th
century. Gradient Descent is defined as one of the most commonly used iterative
optimization algorithms of machine learning to train the machine learning and deep learning
models. It helps in finding the local minimum of a function.

The best way to define the local minimum or local maximum of a function using gradient
descent is as follows:

If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.

Whenever we move towards a positive gradient or towards the gradient of the function at
the current point, we will get the local maximum of that function.

144
Figure 4.4: Gradient Descent

This entire procedure is known as Gradient Ascent, which is also known as steepest
descent. The main objective of using a gradient descent algorithm is to minimize the cost
function using iteration. To achieve this goal, it performs two steps iteratively:

Calculates the first-order derivative of the function to compute the gradient or slope of that
function.

Move away from the direction of the gradient, which means slope increased from the current
point by alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in
the optimization process which helps to decide the length of the steps.

4.4.2 What is Cost-function?


The cost function is defined as the measurement of difference or error between actual values
and expected values at the current position and present in the form of a single real number. It
helps to increase and improve machine learning efficiency by providing feedback to this
model so that it can minimize error and find the local or global minimum. Further, it
continuously iterates along the direction of the negative gradient until the cost function
approaches zero. At this steepest descent point, the model will stop learning further.
Although cost function and loss function are considered synonymous, also there is a minor
difference between them. The slight difference between the loss function and the cost
function is about the error within the training of machine learning models, as loss function
refers to the error of one training example, while a cost function calculates the average error
across an entire training set.

The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to reduce
145
the cost function.

Hypothesis:
Parameters:
Cost function:
Goal:

4.4.3 How does Gradient Descent work?


Before starting the working principle of gradient descent, we should know some basic
concepts to find out the slope of a line from linear regression. The equation for simple linear
regression is given as:

Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.

Figure 4.5: Point of Convergence

The starting point(shown in above fig.) is used to evaluate the performance as it is considered
just as an arbitrary point. At this starting point, we will derive the first derivative or slope
and then use a tangent line to calculate the steepness of this slope. Further, this slope will
inform the updates to the parameters (weights and bias).

146
The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.

The main objective of gradient descent is to minimize the cost function or the error between
expected and actual. To minimize the cost function, two data points are required:

4.4.4 Direction & Learning Rate


These two factors are used to determine the partial derivative calculation of future iteration
and allow it to the point of convergence or local minimum or global minimum. Let's discuss
learning rate factors in brief;

4.4.5 Learning Rate:


It is defined as the step size taken to reach the minimum or lowest point. This is typically a
small value that is evaluated and updated based on the behavior of the cost function. If the
learning rate is high, it results in larger steps but also leads to risks of overshooting the
minimum. At the same time, a low learning rate shows the small step sizes, which
compromises overall efficiency but gives the advantage of more precision.

Figure 4.6: Learning Rate

4.5.6 Types of Gradient Descent


Based on the error in various training models, the Gradient Descent learning algorithm can
be divided into Batch gradient descent, stochastic gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:

i. Batch Gradient Descent:

Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples. This procedure is known as the

147
training epoch. In simple words, it is a greedy approach where we have to sum over all
examples for each update.

 Advantages of Batch gradient descent:

It produces less noise in comparison to other gradient descent.

It produces stable gradient descent convergence.

It is Computationally efficient as all resources are used for all training samples.

ii. Stochastic gradient descent

Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration. Or in other words, it processes a training epoch for each example
within a dataset and updates each training example's parameters one at a time. As it requires
only one training example at a time, hence it is easier to store in allocated memory. However,
it shows some computational efficiency losses in comparison to batch gradient systems as it
shows frequent updates that require more detail and speed. Further, due to frequent updates,
it is also treated as a noisy gradient. However, sometimes it can be helpful in finding the
global minimum and also escaping the local minimum.

 Advantages of Stochastic gradient descent:

In Stochastic gradient descent (SGD), learning happens on every example, and it consists of
a few advantages over other gradient descent.

i. It is easier to allocate in desired memory.


ii. It is relatively fast to compute than batch gradient descent.
iii. It is more efficient for large datasets.

iii. MiniBatch Gradient Descent:

Mini Batch gradient descent is the combination of both batch gradient descent and stochastic
gradient descent. It divides the training datasets into small batch sizes then performs the
updates on those batches separately. Splitting training datasets into smaller batches make a
balance to maintain the computational efficiency of batch gradient descent and speed of
stochastic gradient descent. Hence, we can achieve a special type of gradient descent with
higher computational efficiency and less noisy gradient descent.

 Advantages of Mini Batch gradient descent:


148
i. It is easier to fit in allocated memory.
ii. It is computationally efficient.
iii. It produces stable gradient descent convergence.

 Challenges with the Gradient Descent

Although we know Gradient Descent is one of the most popular methods for optimization
problems, it still also has some challenges. There are a few challenges as follows:

i. Local Minima and Saddle Point:

For convex problems, gradient descent can find the global minimum easily, while for non-
convex problems, it is sometimes difficult to find the global minimum, where the machine
learning models achieve the best results.

Figure 4.7: Saddle Point and Local Minima

Whenever the slope of the cost function is at zero or just close to zero, this model stops
learning further. Apart from the global minimum, there occur some scenarios that can show
this slop, which is saddle point and local minimum. Local minima generate the shape similar
to the global minimum, where the slope of the cost function increases on both sides of the
current points.

In contrast, with saddle points, the negative gradient only occurs on one side of the point,

149
which reaches a local maximum on one side and a local minimum on the other side. The
name of a saddle point is taken by that of a horse's saddle.

The name of local minima is because the value of the loss function is minimum at that point
in a local region. In contrast, the name of the global minima is given so because the value of
the loss function is minimum there, globally across the entire domain the loss function.

ii. Vanishing and Exploding Gradient

In a deep neural network, if the model is trained with gradient descent and backpropagation,
there can occur two more issues other than local minima and saddle point.

iii. Vanishing Gradients:

Vanishing Gradient occurs when the gradient is smaller than expected. During
backpropagation, this gradient becomes smaller that causing the decrease in the learning rate
of earlier layers than the later layer of the network. Once this happens, the weight parameters
update until they become insignificant.

iv. Exploding Gradient:

Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient
is too large and creates a stable model. Further, in this scenario, model weight increases, and
they will be represented as NaN. This problem can be solved using the dimensionality
reduction technique, which helps to minimize complexity within the model.

Figure 4.7: Gradient Descent


150
Lecture: 27

4.5 Multilayer Networks


To be accurate a fully connected Multi-Layered Neural Network is known as Multi-
Layer Perceptron. A Multi-Layered Neural Network consists of multiple layers of
artificial neurons or nodes. Unlike Single-Layer Neural networks, in recent times
most networks have Multi-Layered Neural Network. The following diagram is a
visualization of a multi-layer neural network.

Figure 4.8: Multi-Layer Network

Here the nodes marked as “1” are known as bias units. The leftmost layer or Layer
1 is the input layer, the middle layer or Layer 2 is the hidden layer and the rightmost
layer or Layer 3 is the output layer. It can say that the above diagram has 3 input
units (leaving the bias unit), 1 output unit, and 4 hidden units(1 bias unit is not
included).

A Multi-layered Neural Network is a typical example of the Feed Forward Neural


Network. The number of neurons and the number of layers consists of the
hyperparameters of Neural Networks which need tuning. In order to find ideal values
151
for the hyperparameters, one must use some cross-validation techniques. Using the
Back-Propagation technique, weight adjustment training is carried out.

4.5.1 Formula for Multi-Layered Neural Network

Suppose we have xn inputs(x1, x2….xn) and a bias unit. Let the weight applied to be
w1, w2…..wn. Then find the summation and bias unit on performing dot product
among inputs and weights as:

r = Σmi=1 wixi + bias

On feeding the r into activation function F(r) we find the output for the hidden layers.
For the first hidden layer h1, the neuron can be calculated as:

h11 = F(r)

For all the other hidden layers repeat the same procedure. Keep repeating the process
until reach the last weight set.

152
Lecture: 28

4.6 Derivation of Backpropagation

Figure 4.9: Backpropagation Flow

Conceptually, a network forward propagates activation to produce an output and it


backward propagates error to determine weight changes (as shown in Figure 1). The
weights on the connec- tions between neurons mediate the passed values in both
directions.

The Backpropagation algorithm is used to learn the weights of a multilayer neural


network with a fixed architecture. It performs gradient descent to try to minimize the
sum squared error between the network’s output values and the given target values.

Figure 2 depicts the network components which affect a particular weight change.
Notice that all the necessary components are locally related to the weight being
updated. This is one feature of backpropagation that seems biologically plausible.
However, brain connections appear to be unidirectional and not bidirectional as
would be required to implement backpropagation.

153
4.6.1 Notation

For the purpose of this derivation, we will use the following notation:

The subscript k denotes the output layer.

The subscript j denotes the hidden layer.

The subscript i denotes the input layer.

Figure 4.9: Neurons

Figure 2: The change to a hidden to output weight depends on error (depicted as a lined
pattern) at the output node and activation (depicted as a solid pattern) at the hidden node.
While the change to a input to hidden weight depends on error at the hidden node (which in
turn depends on error at all the output nodes) and activation at the input node.

• wkj denotes a weight from the hidden to the output layer.


• wji denotes a weight from the input to the hidden layer.
• a denotes an activation value.
• t denotes a target value.
• net denotes the net input.

154
4.7 Review of Calculus Rules

4.7.1 Gradient Descent on Error


We can motivate the backpropagation learning algorithm as gradient descent on sum-
squared error (we square the error because we are interested in its magnitude, not its sign).
The total error in a network is given by the following equation (the 1/2 will simplify things
later).

We want to adjust the network’s weights to reduce this overall error.

We will begin at the output layer with a particular weight.

However, error is not directly a function of a weight. We expand this as follows.

155
Let’s consider each of these partial derivatives in turn. Note that only one term of the E
summation will have a non-zero derivative: the one associated with the particular weight we
are considering.

4.7.2 Derivative of the error with respect to the activation

Now we see why


2
the 1 in the E term was useful.

4.7.3 Derivative of the activation with respect to the net input

We’d like to be able to rewrite this result in terms of the activation function. Notice
that:

Using this fact, we can rewrite the result of the partial derivative as:

4.7.4 Derivative of the net input with respect to a weight


Note that only one term of the net summation will have a non-zero derivative: again the one
associated with the particular weight we are considering.

156
4.7.5 Weight change rule for a hidden to output weight
Now substituting these results back into our original equation, we have:

Notice that this looks very similar to the Perceptron


x` Training Rule. The only difference is the
inclusion of the derivative of the activation function. This equation is typically simplified as
shown below where the δ term represents the product of the error with the derivative of the
activation function.

4.7.6 Weight change rule for an input to hidden weight

Now we have to determine the appropriate weight change for an input to hidden weight.
This is more complicated because it depends on the error at all of the nodes this weighted
connection can lead to.

157
158
159
160
161
Lecture: 29
4.8 Generalization
Generalization is a fundamental concept in machine learning (ML) and artificial
intelligence (AI). It refers to a model's capacity to function well with fresh, previously
unknown data that was not part of the training dataset. Generalization rules in AI enable
models to make correct predictions and judgments based on the information gathered
from training data. These criteria ensure that models learn the underlying patterns and
relationships in the data rather than memorizing individual samples. By focusing on
generalization, AI models can apply what they've learnt to a variety of settings, increasing
their efficacy and reliability.
4.8.1 Difference Between Memorization and Generalization

When a model learns training data so well that it performs very well on it but is unable to
apply this knowledge to fresh data, this is known as memorization. On the other hand, a
well-generalizing model can deduce and forecast results for data points it hasn't seen in
training.
4.8.2 Generalization vs. Overfitting

When a model learns sufficiently from the noise and details in the training set that it
becomes unreliable on data that is new, this is known as overfitting. Since the objective
of generalization is to develop models that continue to perform well on seen and unseen
data, this is a crucial problem.
4.8.3 Theoretical Foundations of Generalization

 Statistical Learning Theory: The theory of statistical learning offers a framework to


comprehend how and why algorithms become more generic. It involves ideas such as
training error-based empirical risk minimization and training error-based structural
risk minimization (balancing model complexity and training error).

 Bias-Variance Trade-off: Understanding the bias-variance trade-off is crucial for


understanding generalization. An underfitting model is one in which the details of the
data are not well captured by the model due to high bias. Overfitting, in which the
model is very intricate and records noise, might result from high variation. An ideal
equilibrium between bias and variability is sought after by effective generalization.

 Occam’s Razor in Model Selection: According to Occam's Razor, simpler models


are better as long as they function adequately. It suggests that models in the domain of
AI avoid unnecessary complexity to improve generalization.
4.9 Self Organizing Maps
Self-Organizing Map (or Kohonen Map or SOM) is a type of Artificial Neural Network
162
which is also inspired by biological models of neural systems from the 1970s. It follows
an unsupervised learning approach and trained its network through a competitive learning
algorithm. SOM is used for clustering and mapping (or dimensionality reduction)
techniques to map multidimensional data onto lower-dimensional which allows people to
reduce complex problems for easy interpretation. SOM has two layers, one is the Input
layer and the other one is the Output layer.

The architecture of the Self Organizing Map with two clusters and n input features of any
sample is given below:

Figure 4.10: SOM

4.9.1 How do SOM works?

Let’s say an input data of size (m, n) where m is the number of training examples and n
is the number of features in each example. First, it initializes the weights of size (n, C)
where C is the number of clusters. Then iterating over the input data, for each training
example, it updates the winning vector (weight vector with the shortest distance (e.g
Euclidean distance) from training example). Weight updation rule is given by :

wij = wij(old) + alpha(t) * (xik - wij(old))

where alpha is a learning rate at time t, j denotes the winning vector, i denotes the
ith feature of training example and k denotes the kth training example from the input data.
After training the SOM network, trained weights are used for clustering new examples.

163
A new example falls in the cluster of winning vectors.
4.9.2 Algorithm

Training:

i. Step 1: Initialize the weights wij random value may be assumed. Initialize the
learning rate α.

ii. Step 2: Calculate squared Euclidean distance.

i. D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m

iii. Step 3: Find index J, when D(j) is minimum that will be considered as winning
index.

iv. Step 4: For each j within a specific neighborhood of j and for all i, calculate the
new weight.

i. wij(new)=wij(old) + α[xi – wij(old)]

v. Step 5: Update the learning rule by using :

i. α(t+1) = 0.5 * t

vi. Step 6: Test the Stopping Condition.

164
Lecture: 30
4.10 Convolutional Neural Network (CNN)
A Convolutional Neural Network (CNN), also known as ConvNet, is a specialized type of deep
learning algorithm mainly designed for tasks that necessitate object recognition, including image
classification, detection, and segmentation. CNNs are employed in a variety of practical
scenarios, such as autonomous vehicles, security camera systems, and others.

4.10.1 The importance of CNNs


There are several reasons why CNNs are important in the modern world, as highlighted below:

CNNs are distinguished from classic machine learning algorithms such as SVMs and decision
trees by their ability to autonomously extract features at a large scale, bypassing the need for
manual feature engineering and thereby enhancing efficiency.

The convolutional layers grant CNNs their translation-invariant characteristics, empowering


them to identify and extract patterns and features from data irrespective of variations in position,
orientation, scale, or translation.

A variety of pre-trained CNN architectures, including VGG-16, ResNet50, Inceptionv3, and


EfficientNet, have demonstrated top-tier performance. These models can be adapted to new tasks
with relatively little data through a process known as fine-tuning.

Beyond image classification tasks, CNNs are versatile and can be applied to a range of other
domains, such as natural language processing, time series analysis, and speech recognition.

4.10.2 Inspiration Behind CNN and Parallels With The Human Visual System
Convolutional neural networks were inspired by the layered architecture of the human visual
cortex, and below are some key similarities and differences:

165
Figure 4.11: Biologically inspired CNN

i. Hierarchical architecture: Both CNNs and the visual cortex have a hierarchical
structure, with simple features extracted in early layers and more complex features built
up in deeper layers. This allows increasingly sophisticated representations of visual
inputs.
ii. Local connectivity: Neurons in the visual cortex only connect to a local region of the
input, not the entire visual field. Similarly, the neurons in a CNN layer are only connected
to a local region of the input volume through the convolution operation. This local
connectivity enables efficiency.
iii. Translation invariance: Visual cortex neurons can detect features regardless of their
location in the visual field. Pooling layers in a CNN provide a degree of translation
invariance by summarizing local features.
iv. Multiple feature maps: At each stage of visual processing, there are many different
feature maps extracted. CNNs mimic this through multiple filter maps in each convolution
layer.
v. Non-linearity: Neurons in the visual cortex exhibit non-linear response properties. CNNs
achieve non-linearity through activation functions like ReLU applied after each
convolution.
vi. CNNs mimic the human visual system but are simpler, lacking its complex feedback
mechanisms and relying on supervised learning rather than unsupervised, driving
advances in computer vision despite these differences.

166
4.10.3 Key Components of a CNN
The convolutional neural network is made of four main parts.

But how do CNNs Learn with those parts?

They help the CNNs mimic how the human brain operates to recognize patterns and features in
images:

 Convolutional layers
 Rectified Linear Unit (ReLU for short)
 Pooling layers
 Fully connected layers
This section dives into the definition of each one of these components through the example of
the following example of classification of a handwritten digit.

Figure 4.12: Architecture of CNN applied to digit recognition

167
Lecture: 31

4.11 Convolution layers

This is the first building block of a CNN. As the name suggests, the main mathematical task
performed is called convolution, which is the application of a sliding window function to a matrix
of pixels representing an image. The sliding function applied to the matrix is called kernel or
filter, and both can be used interchangeably.

In the convolution layer, several filters of equal size are applied, and each filter is used to
recognize a specific pattern from the image, such as the curving of the digits, the edges, the whole
shape of the digits, and more.

Put simply, in the convolution layer, we use small grids (called filters or kernels) that move over
the image. Each small grid is like a mini magnifying glass that looks for specific patterns in the
photo, like lines, curves, or shapes. As it moves across the photo, it creates a new grid that
highlights where it found these patterns.

For example, one filter might be good at finding straight lines, another might find curves, and so
on. By using several different filters, the CNN can get a good idea of all the different patterns
that make up the image.

Let’s consider this 32x32 grayscale image of a handwritten digit. The values in the matrix are
given for illustration purposes.

Figure 4.13: Illustration of the input image and its pixel representation

Also, let’s consider the kernel used for the convolution. It is a matrix with a dimension of 3x3.
The weights of each element of the kernel is represented in the grid. Zero weights are represented
in the black grids and ones in the white grid.
168
4.11.1 Do we have to manually find these weights?
In real life, the weights of the kernels are determined during the training process of the neural
network.

Using these two matrices, we can perform the convolution operation by applying the dot product,
and work as follows:

i. Apply the kernel matrix from the top-left corner to the right.
ii. Perform element-wise multiplication.
iii. Sum the values of the products.
iv. The resulting value corresponds to the first value (top-left corner) in the convoluted matrix.
v. Move the kernel down with respect to the size of the sliding window.
vi. Repeat steps 1 to 5 until the image matrix is fully covered.
vii. The dimension of the convoluted matrix depends on the size of the sliding window. The higher
the sliding window, the smaller the dimension.

Figure 4.14: Application of the convolution task using a stride of 1 with 3x3 kernel

Another name associated with the kernel in the literature is feature detector because the weights
can be fine-tuned to detect specific features in the input image.

For instance:

i. Averaging neighboring pixels kernel can be used to blur the input image.
ii. Subtracting neighboring kernel is used to perform edge detection.
169
The more convolution layers the network has, the better the layer is at detecting more abstract
features.

4.11.2 Activation function


A ReLU activation function is applied after each convolution operation. This function helps the
network learn non-linear relationships between the features in the image, hence making the
network more robust for identifying different patterns. It also helps to mitigate the vanishing
gradient problems.

4.12.3 Pooling layer


The goal of the pooling layer is to pull the most significant features from the convoluted matrix.
This is done by applying some aggregation operations, which reduce the dimension of the feature
map (convoluted matrix), hence reducing the memory used while training the network. Pooling
is also relevant for mitigating overfitting.

The most common aggregation functions that can be applied are:

i. Max pooling, which is the maximum value of the feature map


ii. Sum pooling corresponds to the sum of all the values of the feature map
iii. Average pooling is the average of all the values.

Below is an illustration of each of the previous example:

Figure 4.15: Application of max pooling with a stride of 2 using a 2x2 filter

Also, the dimension of the feature map becomes smaller as the pooling function is applied.

The last pooling layer flattens its feature map so that it can be processed by the fully connected
layer.

170
4.11.4 Fully connected layers
These layers are in the last layer of the convolutional neural network, and their inputs correspond
to the flattened one-dimensional matrix generated by the last pooling layer. ReLU activations
functions are applied to them for non-linearity.

Finally, a softmax prediction layer is used to generate probability values for each of the possible
output labels, and the final label predicted is the one with the highest probability score.

4.11.5 Overfitting and Regularization in CNNs


Overfitting is a common challenge in machine learning models and CNN deep learning projects.
It happens when the model learns the training data too well (“learning by heart”), including its
noise and outliers. Such a learning leads to a model that performs well on the training data but
badly on new, unseen data.

This can be observed when the performance on training data is too low compared to the
performance on validation or testing data, and a graphical illustration is given below:

Figure 4.16: Underfitting v/s Overfitting

Deep learning models, especially Convolutional Neural Networks (CNNs), are particularly
susceptible to overfitting due to their capacity for high complexity and their ability to learn
detailed patterns in large-scale data.

171
Lecture: 32

4.11.6 Seven strategies to mitigate overfitting in CNNs


Several regularization techniques can be applied to mitigate overfitting in CNNs, and some are
illustrated below:

Figure 4.16: & strategies to mitigate overfitting in CNNs

I. Dropout: This consists of randomly dropping some neurons during the training process,
which forces the remaining neurons to learn new features from the input data.
II. Batch normalization: The overfitting is reduced at some extent by normalizing the input layer
by adjusting and scaling the activations. This approach is also used to speed up and stabilize
the training process.
III. Pooling Layers: This can be used to reduce the spatial dimensions of the input image to
provide the model with an abstracted form of representation, hence reducing the chance of
overfitting.
IV. Early stopping: This consists of consistently monitoring the model’s performance on
validation data during the training process and stopping the training whenever the validation
error does not improve anymore.
V. Noise injection: This process consists of adding noise to the inputs or the outputs of hidden
layers during the training to make the model more robust and prevent it from a weak
generalization.
VI. L1 and L2 normalizations: Both L1 and L2 are used to add a penalty to the loss function
based on the size of weights. More specifically, L1 encourages the weights to be spare, leading
to better feature selection. On the other hand, L2 (also called weight decay) encourages the
weights to be small, preventing them from having too much influence on the predictions.
172
VII. Data augmentation: This is the process of artificially increasing the size and diversity of the
training dataset by applying random transformations like rotation, scaling, flipping, or
cropping to the input images.

4.11.7 Practical Applications of CNNs


Convolutional Neural Networks have revolutionized the field of computer vision, leading to
significant advancements in many real-world applications. Below are a few examples of how
they are applied.

Figure 4.18: Some practical applications of CNNs

 Image classification: Convolutional neural networks are used for image categorization, where
images are assigned to predefined categories. One use of such a scenario is automatic photo
organization in social media platforms.
 Object detection: CNNs are able to identify and locate multiple objects within an image. This
capability is crucial in multiple scenarios of shelf scanning in retail to identify out-of-stock items.
 Facial recognition: this is also one of the main industries of application of CNNs. For instance,
this technology can be embedded into security systems for efficient control of access based on
facial features.

173
4.12 Important Questions (PYQs)
Q1: Explain different layers of CNN (Convolutional network) with suitable examples.

Q2: What is Self-Organizing Map (SOM)? Explain the stages and steps in SOM Algorithm.

Q3: Explain Gradient Descent and delta rule.

Q4: What are Neural Networks? What are the types of Neural Networks?

Q5: Discuss the benefits of Artificial Neural Networks.

Q6: Explain Back propagation Algorithm.

Q7: Write a short note on Unsupervised Learning.

Q8: Discuss the role of Activation function in neural networks. Also discuss various types
of activation functions with formulas and diagrams.

Q9: Describe Artificial Neural Networks (ANN) with different Layers and its
characteristics.

Q10: What are the Advantages and Disadvantages of ANN? Explain the application areas
of ANN?

Q11: Explain the Architecture and different types of Neuron.

Q12: Explain different types of Gradient Descent with advantages and disadvantages.

Q13: Explain generalized Delta Learning Rule.

Q14: Explain Perceptron with single Flow Graph.

Q15: State and Prove Perceptron Convergence Theorem.

Q16: Explain Multilayer Perceptron with its Architecture and Characteristics.

Q17: Discuss selection of various parameters in Back propagation Neural Network (BPN)
and its effects.

Q18: Describe the Architecture, Limitations, Advantages and Disadvantages of Deep


Learning with various Applications.

Q19: Explain 1D and 2D Convolutional Neural Network.

Q20: Describe Diabetic Retinopathy on the basis of Deep Learning.

174
UNIT 5 - Reinforcement Learning

REINFORCEMENT LEARNING

 To Understand the basics Reinforcement Leaming, Leaming Task.


WHY  To Understand the basics of genetic algorithms.

 Implement various algorithms of Models for Reinforcement - (Markov


Decision process, Q Leaming - Q Leaming function, Q Leaming Algorithm
WHAT
 In the Selection of Datasets for various Reinforced Learning Problems.
WHERE
 Applications of Genetic Algorithms.

Lecture: 33

5.1 What is Reinforcement Learning?


Reinforcement Learning is a part of machine learning. Here, agents are self trained on reward
and punishment mechanisms. It’s about taking the best possible action or path to gain maximum
rewards and minimum punishment through observations in a specific situation. It acts as a signal
to positive and negative behaviours. Essentially an agent (or several) is built that can perceive
and interpret the environment in which is placed, furthermore, it can take actions and interact
with it.

Figure 5.1: Reinforcement Learning

Reinforcement learning, a type of machine learning, in which agents take actions in an


environment aimed at maximizing their cumulative rewards – NVIDIA

5.2 Terms used in Reinforcement Learning


i. Agent (): An entity that can perceive/explore the environment and act upon it.

175
ii. Environment (): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.

iii. Action (): Actions are the moves taken by an agent within the environment.

iv. State (): State is a situation returned by the environment after each action taken by the
agent.

v. Reward (): A feedback returned to the agent from the environment to evaluate the action
of the agent.

vi. Policy (): Policy is a strategy applied by the agent for the next action based on the current
state.

vii. Value (): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.

viii. Q-value (): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).

5.3 Key Features of Reinforcement Learning


 In RL, the agent is not instructed about the environment and what actions need to be taken.

 It is based on the hit and trial process.

 The agent takes the next action and changes states according to the feedback of the previous
action.

 The agent may get a delayed reward.

 The environment is stochastic, and the agent needs to explore it to reach to get the maximum
positive rewards.
5.4 Approaches to implement Reinforcement Learning
There are mainly three ways to implement reinforcement-learning in ML, which are:

5.4.1 Value-based:

The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π..

176
5.4.2 Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards without
using the value function. In this approach, the agent tries to apply such a policy that the action
performed in each step helps to maximize the future reward. The policy- based approach has
mainly two types of policy:

i. Deterministic: The same action is produced by the policy (π) at any state. o Stochastic:
In this policy, probability determines the produced action.

ii. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different for
each environment.

5.5 Elements of Reinforcement Learning


There are four main elements of Reinforcement Learning, which are given below:

 Policy
 Reward Signal
 Value Function
 Model of the environment

I. Policy: A policy can be defined as a way how an agent behaves at a given time. It maps
the perceived states of the environment to the actions taken on those states. A policy is
the core element of the RL as it alone can define the behavior of the agent. In some cases,
it may be a simple function or a lookup table, whereas, for other cases, it may involve
general computation as a search process. It could be deterministic or a stochastic policy:

For deterministic policy: a = π(s) For stochastic policy: π(a | s) = P[At =a | St = s]

II. Reward Signal: The goal of reinforcement learning is defined by the reward signal. At
each state, the environment sends an immediate signal to the learning agent, and this
signal is known as a reward signal. These rewards are given according to the good and
bad actions taken by the agent. The agent's main objective is to maximize the total number
of rewards for good actions. The reward signal can change the policy, such as if an action
selected by the agent leads to low reward, then the policy may change to select other
actions in the future.

177
III. Value Function: The value function gives information about how good the situation and
action are and how much reward an agent can expect. A reward indicates the immediate
signal for each good and bad action, whereas a value function specifies the good state and
action for the future. The value function depends on the reward as, without reward, there
could be no value. The goal of estimating values is to achieve more rewards.
IV. Model: The last element of reinforcement learning is the model, which mimics the
behavior of the environment. With the help of the model, one can make inferences about
how the environment will behave. Such as, if a state and an action are given, then a model
can predict the next state and reward.
V. The model is used for planning, which means it provides a way to take a course of action
by considering all future situations before actually experiencing those situations. The
approaches for solving the RL problems with the help of the model are termed as the
model-based approach. Comparatively, an approach without using a model is called a
model-free approach.

5.6 Reinforcement learning.


i. Reinforcement learning is the study of how animals and artificial systems can learn to
optimize their behaviour in the face of rewards and punishments.
ii. Reinforcement learning algorithms related to methods of dynamic programming which is a
general approach to optimal control.
iii. Reinforcement learning phenomena have been observed in psychological studies of animal
behaviour, and in neurobiological investigations of neuromodulation and addiction.
iv. The task of reinforcement learning is to use observed rewards to learn an optima policy for
the environment. An optimal policy is a policy that maximizes the expected total reward.
v. Without some feedback about what is good and what is bad, the agent will have no grounds
for deciding which move to make.
vi. The agents needs to know that something good has happened when it wins and that something
bad has happened when it loses.
vii. This kind of feedback is called a reward or reinforcement.
viii. Reinforcement learning is valuable in the field of robotics, where the tasks to be performed
are frequently complex enough to defy encoding as programs and no training data is available.
ix. In many complex domains, reinforcement learning is the only feasible way to train a program
to perform at high levels.

5.7 Differentiate between reinforcement and supervised learning.


Reinforcement Learning:
i. Reinforcement learning is all about making decisions sequentially. In simple words we can
say that the output depends on the state of the current input and the next Input depends on the
output of the previous input.
ii. In reinforcement learning decision is dependent. So, we give labels to sequences of dependent
decisions.
iii. Example: Chess game.

Supervised Learning:
i. In supervised learning. The decision is made on the initial input or the input given at the start.
ii. Supervised learning decisions are independent of so labels are given to each decision.
178
iii. Example: Object recognition.

 Explain passive reinforcement learning and active reinforcement learning.

I. Passive reinforcement learning

i. In passive learning, the agent's policy π is fixed. In states, it always executes the action π
(s).
ii. Its goal is simply to learn how good the policy is -that is, to learn the utility function U n(s).
iii. Fig. 5.7.1 shows a policy for the world and the corresponding utilities.
iv. In Fig. 5.7.1(a) the policy happens to be optimal with rewards of R(s) = - 0.04 in the non-
terminal states and no discounting.
v. Passive learning agent does not know the transition model T(s, a, s'), which specifies the
probability of reaching state s' from state’s after doing action a; nor does it know the reward
function R(s) which specifies the reward for each state.
vi. The agent executes a set of trials in the environment using its policy π.
vii. In each trial, the agent starts in state (1, 1) and experiences a sequence of state transitions
until it reaches one of the terminal states, (4, 2) or (4, 3).
viii. Its percept’s supply both the current state and the reward received in that state. Typical
trials might look like this.
ix. (1,1)-0.04  (1,2)-0.04  (1,3)-0.04  (1,2)-0.04  (1,3)-0.04  (2,3)-0.04  (3,3)-0.04  (4,3)+1
x. (1,1)-0.04  (1,2)-0.04  (1,3)-0.04  (2,3)-0.04  (3,3)-0.04  (3,2)-0.04  (3,3)-0.04  (4,3)+1
xi. (1,1)-0.04  (1,2)-0.04 (3,1)-0.04  (3,2)-0.04 (4,2)+1
   +1
˄ -1
˄
˄   
1 2 3 4
Fig. 5.7.1(a) A policy n for the 4x3 world

0.812 0.868 0.918 +1


0.762 0.660 -1
0.705 0.655 0.611 0.388
1 2 3 4
Fig. 5.7.1(b) The utilities of the states in the 4x3world, given policy π

xi. Each state percept is subscripted with the reward received. The object is to use the
information about rewards to learn the expected utility Uπ(s) associated with each non-
terminal state s.
xii. The utility is defined to be the expected sum of (discounted) rewards obtained if policy π
is followed:

Uπ(s) = E[ ∑t=0 n γt R( st ) | π,s0 = s

Where γ is a discount factor, for the 4 x5 world we set γ = 1.

179
II. Active reinforcement learning:

a. An active agent must decide what actions to take.


b. First, the agent will need to learn a complete model with outcome probabilities for
all actions, rather than just model for the fixed policy.
c. We need to take into account the fact that the agent has a choice of actions.
d. The utilities it needs to learn are those defined by the optimal policy, they obey the
Bellman equations
e. U(S) = RS) + γ max ∑ T(s, a, s') U(s')
f. These equations can be solved to obtain the utility function U using the value
iteration or policy iteration algorithms.
g. A utility function U is optimal for the learned model; the agent can extract an
optimal action by one-step look-ahead to maximize the expected utility.
h. Alternatively, if it uses policy iteration, the optimal policy is already available, so
it should simply execute the action the optimal policy recommends.

5.8 Types of reinforcement learning:


1. Positive reinforcement learning:
a. Positive reinforcement learning 1s defined as when an event, occurs due to a particular behaviour,
increases the strength and the frequency of the behaviour.
b. In other words, it has a positive effect on the behaviour.
c. Advantages of positive reinforcement learning are:
i. Maximizes performance.
ii. Sustain change for a long period of time.
d. Disadvantages of positive reinforcement learning:
i. Too much reinforcement can lead to overload of states which can diminish the results.
2. Negative reinforcement learning:
a. Negative reinforcement is defined as strengthening of behaviour because a negative condition
is stopped or avoided.
b. Advantages of negative reinforcement learning:
i. Increases behaviour.
ii. It provides defiance to minimum standard of performance.
c. Disadvantages of negative reinforcement learning:
i. It only provides enough to meet up the minimum behaviour

What are the elements of reinforcement learning?


Elements of reinforcement learning:
1. Policy (π):
a. It defines the behaviour of the agent which action to take in a given state to maximize the received
reward in the long term.
b. It stimulus-response rules or associations.
c. It could be a simple lookup table or function, or need more extensive computation (for example,
search).
d. It can be probabilistic.
2. Reward function (r):
a. It defines the goal in a reinforcement learning problem, maps a state or action to a scalar number,
the reward (or reinforcement).

180
b. The RL agent's objective is to maximize the total reward it receives in the long run.
c. It defines good and bad events.
d. It cannot be altered by the agent but may inform change of policy.
e. It can be probabilistic (expected reward).
3. Value function (V):
a. It defines the total amount of reward an agent can expect to accumulate over the future, starting
from that state.
b. A state may yield a low reward but have a high value (or the opposite). For example, immediate
pain/pleasure vs. long term happiness.
4. Transition model (M):
a. It defines the transitions in the environment action a taken in the states, will lead to state s b. It
can be probabilistic.

Describe briefly learning task used in machine learning.


1. A machine learning task is the type of prediction or inference being made, based on the problem
or question that is being asked, and the available data.
2. For example, the classification task assigns data to categories, and the clustering task groups
data according to similarity.
3. Machine learning tasks rely on patterns in the data rather than being explicitly programmed.
4. A supervised machine learning task that is used to predict which of two classes (categories) an
instance of data belongs to.
5. The input of a classification algorithm is a set of labeled examples, where each label is an
integer of either 0 or 1.
6. The output of a binary classification algorithm is a classifier, which we can use to predict the class
of new unlabeled instances.
7. An unsupervised machine learning task that is used to group instances of data into clusters that
contain similar characteristics.
8. Clustering can also be used to identify relationships ina dataset that we might not logically derive
by browsing or simple observation.
9. The inputs and outputs of a clustering algorithm depend on the methodology chosen.

5.9 Different machine learning task.


Following are most common machine learning tasks
i. Data pre-processing: Before starting training the models, it is important to prepare data
appropriately. As part of data pre-processing following is done:
a. Data cleaning
b. Handling missing data
ii. Exploratory data analysis: Once data is pre-processed, the next step is to perform exploratory
data analysis to understand data distribution and relationship between /within the data.
iii. Feature Engineering / Selection: Feature selection is one of the critical tasks which would be used
when building machine learning models. Feature selection is important because selecting right
features would not only help build models of higher accuracy but also help achieve objectives
related to building simpler models, reduce overfitting etc.,
iv. Regression: Regression tasks deal with estimation of numerical values (continuous variables).
Some of the examples include estimation of housing price, product price, stock price etc.

181
v. Classification: Classification task is related with predicting a category of a data (discrete
variables). Most common example is predicting whether or not an email is spam or not, whether a
person is suffering from a particular disease or not, whether a transaction is fraud or not, etc.
vi. Clustering: Clustering tasks are all about finding natural groupings of data and a label associated
with each of these groupings (clusters).
vii. Some of the common examples includes customer segmentation, product features identification for
product roadmap.
viii. Multivariate querying: Multivariate querying is about querying or finding similar objects.
ix. Density estimation: Density estimation problems are related with finding likelihood or frequency
of objects.
x. Dimension reduction: Dimension reduction is the process of reducing the number of random
variables under consideration, and can be divided into feature selection and feature extraction.
xi. Model algorithm/ selection: Many a times, there are multiple models which are trained using
different algorithms. One of the important tasks is to select most optimal models for deploying
them in production.
xii. Testing and matching: Testing and matching tasks relates to comparing data sets.

182
Lecture: 34

5.10 Reinforcement learning with the help of an example.


1. Reinforcement learning (RL) is learning concerned with how software agents ought to take
actions in an environment in order to maximize the notion of cumulative reward.
2. The software agent is not told which actions to take, but instead must discover which actions
yield the most reward by trying them.
For example,
Consider the scenario of teaching new tricks to a cat:
i. As cat does not understand English or any other human language, we cannot tell her directly what
to do. Instead, we follow a different strategy.
ii. We emulate a situation, and the cat tries to respond in many different ways. If the cat's response
is the desired way, we will give her fish.
iii. Now whenever the cat is exposed to the same situation, the cat executes similar action even more
enthusiastically in expectation of getting more reward (food).
iv. That's like learning that cat gets from "what to do" from positive experiences.
v. At the same time, the cat also learns what not do when faced with negative experiences.

5.10.1 Working of Reinforcement Learning:


i. In this case, the cat is an agent that is exposed to the environment (In this case, it is your house).
An example of a state could be our cat sitting, and we use a specific word in for cat to walk.
ii. Our agent reacts by performing an action transition from one "state" to another "state."
iii. For example, the cat goes from sitting to walking.
iv. The reaction of an agent is an action, and the policy is a method of selecting an action given a
state in expectation of better outcomes.
v. After the transition, they may get a reward or penalty in return.

5.10.2 Terms used in reinforcement learning method.


Following are the terms used in reinforcement learning
Agent: It is an assumed entity which performs actions in an environment to gain some reward.
i. Environment (e): A scenario that an agent has to face.
ii. Reward (R) : An immediate return given to an agent when he or she performs specific action or
task.
iii. State (s) : State refers to the current situation returned by the environment.
iv. Policy (π): It is a strategy which applies by the agent to decide the next action based on the current
state.
v. Value (V) : It is expected long-term i urn with discount, as compared to the short-term reward.
vi. Value Function: It specifies the value of a state that is the total amount of reward. It is an agent
which should be expected beginning from that state.
vii. Model of the environment: This mimics the behaviour of the environment. It helps you to make
inferences to be made and also determine how the environment will behave.
viii. Model based methods: It is a method for solving reinforcement learning problems which use
model-based methods.
ix. Q value or action value (Q): Q value is quite similar to value. The only difference between the
two is that it takes an additional parameter as a current action.

183
5.10.3 Approaches used to implement reinforcement learning algorithm.
There are three approaches used implement a reinforcement learning algorithm:
1. Value-Based:
a. In a value-based reinforcement learning method, we should try to maximize a value function V(s).
In this method, the agent is expecting a long-term return of the current states under policy π.
2. Policy-based:
In a policy-based RL method, we try to come up with such a policy that the action performed in
every state helps you to gain maximum reward in the future.
Two types of policy-based methods are:
i. Deterministic: For any state, the same action is produced by the policy
ii. Stochastic: Every action has a certain probability, which is determined by the following equation
stochastic policy:
n (a/s) = P / A = a/S = S
3. Model-Based:
a. In this Reinforcement Learning method, we need to create a virtual model for each environment.
b. The agent learns to perform in that specific environment.

184
Lecture: 35

5.11 Learning models of reinforcement learning

a) Reinforcement learning is defined by a specific type of problem, and all its solutions are classed as
reinforcement learning algorithms.
b) In the problem, an agent is supposed to decide the best action to select based on his current state.
c) When this step is repeated, the problem is known as a Markov Decision Process.
d) A Markov Decision Process (MDP) model contains:
i) A State is a set of tokens that represent every state that the agent can be in.
ii) A Model (sometimes called Transition Model) gives an action's effect in a state. In particular,
T(S, a, S') defines a transition T where being in state S and taking an action 'a' takes us to state
S' (S and S' may be same).
iii) An Action A is set of all possible actions. A(s) defines the set of actions that can be taken being
in state S.
iv) A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the
state S. R(S,a) indicates the reward for being in a state S and taking an action 'a'. R(S,a,S 1)
indicates the reward for being in a state S, taking an action 'a' and ending up in a state S'.
v) A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It
indicates the action 'a' to be taken while in state S.

5.11. 1 Challenges of reinforcement learning

(1) We cannot apply reinforcement learning model in all situation. Following are the conditions
when we should not use reinforcement learning model.
(2) When we have enough data to solve the problem with a supervised learning method.
(3) When the action space is large reinforcement learning is computing heavy and time-
consuming.
(4) Challenges we will face while doing reinforcement learning are:
1. Feature/reward design which should be very involved.
2. Parameters may affect the speed of learning.
3. Realistic environments can have partial observability.
4. Too much reinforcement may lead to an overload of states which can diminish
the results.
5. Realistic environments can be non-stationary.

5.12 Q-learning

i. Q-learning is a model-free reinforcement learning algorithm.


ii. Q-learning is a values-based learning algorithm. Value based algorithms updates the value
function based on an equation (particularly Bellman equation).
iii. Whereas the other type, policy-based estimates the value function with a greedy policy
obtained from the last policy improvement.
iv. Q-learning is an off-policy learner i.e., it learns the value of the optimal policy
independently of the agent's actions.

185
v. On the other hand, an on-policy learner learns the value of the policy being carried out by
the agent, including the exploration steps and it will find a policy that is optimal, taking into
account the exploration inherent in the policy.

5.12.1 Q- Learning algorithm

Step 1: Initialize the Q-table: First the Q-table has to be built. There are n columns, where n =
number of actions. There are m rows, where m = number of states.
In our example n = Go left, Go right, Go up and Go down and m = Start, Idle, Correct path, Wrong
path and End. First, lets initialize the value at 0.

Step 2: Choose an action.

Step 3: Perform an action: The combination of steps 2 and 3 is performed for an undefined
amount of time. These steps run until the time training is stopped, or when the training loop
stopped as defined in the code.
a. First, an action (a) in the state (s) is chosen based on the Q-table. Note that, when the episode
initially starts, every Q-value should be 0.
b. Then, update the Q-values for being at the start and moving right using the Bellman equation.

Step 4: Measure reward: Now we have taken an action and observed an outcome and reward.

Step 5: Evaluate: We need to update the function Q(s, a)


this process is repeated again and again until the learning is stopped. In this way the Q-table is
been updated and the value function Q is maximized. Here the Q returns the expected future
reward of that action at that state.

186
Lecture: 36

5.13 Application of reinforcement learning


Following are the applications of reinforcement learning:
1. Robotics for industrial automation.
2. Business strategy planning.
3. Machine learning and data processing.
4. It helps us to create training systems that provide custom instruction and materials according
to the requirement of students.
5. Aircraft control and robot motion control.

Following are the reasons for using reinforcement learning:


1. It helps us to find which situation needs an action.
2. Helps us to discover which action yields the highest reward over the longer period.
3. Reinforcement Learning also provides the learning agent with a reward function.
4. It also allows us to figure out the best method for obtaining large rewards.

5.14 Describe deep Q-learning.


1. In deep Q-learning, we use a neural network to approximate the Q value function.
2. The state is given as the input and the Q-value of all possible actions is generated as the output.
3. The comparison between -learning and deep Q-learning is illustrated below
4. On a higher level, Deep Q learning works as such:
i. Gather and store samples in a replay buffer with current policy.
ii. Random sample batches of experiences from the replay buffer
iii. Use the sampled experiences to update the Q network.
iv. Repeat 1-3.

5.14.1 Steps involved in reinforcement learning using deep Q-learning networks:


1. All the past experience is stored by the user in memory.
2. The next action is determined by the maximum output of the Q-network.
3. The loss function here is mean squared error of the predicted Q-value and the target Q-value-
Q*. This is basically a regression problem.
4. However, we do not know the target or actual value here as we are dealing with a reinforcement
learning problem. Going back to the Q-value update equation derived from the Bellman
equation, we have:
Q (St, At)  Q (St, At) + α [ Rt+1. + γ max Q (St+1, a.) – Q (St , At)]

187
5.14.2 Pseudo code for deep Q-learning.
Start with Q0 (S , a) for all S, a.
Get initial state S
For k = 1,2,... till convergence
Sample action a, get next state S'
If S1 is terminal:
target = R(S, a, S')
Sample new initial state S'
else target = R(S, a, S') + γ max Qk (S', a')
ɵk+1  ɵk - α Δ0 E*-P (s1 |s,a)[(Q0 (S,a) – target (S1))2 ]
S  S1

188
Lecture: 37

5.15 Genetic algorithm

i. Genetic algorithms are computerized search and optimization algorithm based on mechanics
of natural genetics and natural selection.
ii. These algorithms mimic the principle of natural genetics and natural selection to construct
search and optimization procedure.
iii. Genetic algorithms convert the design space into genetic space. Design space is a set of feasible
solutions.
iv. Genetic algorithms work with a coding of variables.
v. The advantage of working with a coding of variables space is that coding discretizes the search
space even though the function may be continuous.
vi. Search space is the space for all possible feasible solutions of particular problem.
vii. Following are the benefits of Genetic algorithm:
a. They are robust.
b. They provide optimization over large space state.
c. They do not break on slight change in input or presence of noise.
viii. Following are the application of Genetic algorithm:
a. Recurrent neural network
b. Mutation testing
c. Code breaking
d. Filtering and signal processing
e. Learning fuzzy rule base

5.15.1 Procedure of Genetic algorithm:


i. Generate a set of individuals as the initial population.
ii. Use genetic operators such as selection or cross over.
iii. Apply mutation or digital reverse if necessary.
iv. Evaluate the fitness function of the new population.
v. Use the fitness function for determining the best individuals and replace predefined
members from the original population.
vi. Iterate steps 2-5 and terminate when some predefined population threshold is met.

5.15.2 Advantages of genetic algorithm:


i. Genetic algorithms can be executed in parallel. Hence, genetic algorithms are faster.
ii. It is useful for solving optimization problems.

5.15.3 Disadvantages of Genetic algorithm:


i. Identification of the fitness function is difficult as it depends on the problem.
ii. The selection of suitable genetic operators is difficult.

189
Lecture: 38

5.16 Cycle of genetic algorithm

Different phases of genetic algorithm are:


1. Initial population:
a. The process begins with a set of individuals which is called a population.
b. Each individual is a solution to the problem we want to solve.
c. An individual is characterized by a set of parameters (variables) known as genes
d. Genes are joined into a string to form a chromosome (solution).
e. In a genetic algorithm, the set of genes of an individual is represented using a string.
f. Usually, binary values are used (string of l’s and O’s).

2. FA (Factor Analysis) fitness function


a. The fitness function determines how fit an individual is (the ability of all individual to compete
with other individual).
b. It gives a fitness score to each individual.
c. The probability that an individual will be selected for reproduction is based on its fitness score.

3. Selection:
a. The idea of selection phase is to select the fittest individuals and let them pass their genes to the
next generation.
b. Two pairs of individuals (parents) are selected based on their fitness scores.
c. Individuals with high fitness have more chance to be selected for reproduction.

4. Crossover:
a. Crossover is the most significant phase in a genetic algorithm.
b. For each pair of parents to be mated, a crossover point is chosen at random from within the genes.
c. For example, consider the crossover point to be 3.
d. Offspring are created by exchanging the genes of parents among themselves until the crossover
point is reached.
e. The new offspring are added to the population.

5. Mutation:
a. When new offspring formed, some of their genes can be subjected to a mutation with a low
random probability.
b. This implies that some of the bits in the bit string can be flipped.
a. Mutation occurs to maintain diversity within the population and prevent premature convergence.

6. Termination:
a. The algorithm terminates if the population has converged (does not produce offspring which are
significantly different from the previous generation).
b. Then it is said that the genetic algorithm has provided a set of solutions to our problem.

190
Lecture: 39

5.17 Mutation
Mutation Operator is a unary operator and it needs only one parent to work on. It does so
by selecting a few genes from our selected chromosome and apply the desired algorithm.
Five Mutation Algorithms for string manipulation –
I. Bit Flip Mutation
II. Random Resetting Mutation
III. Swap Mutation
IV. Scramble Mutation
V. Inversion Mutation
Bit Flip Mutation is mainly used for bit string manipulation while others can be used for
any
kind of strings. Here our chromosome will be represented as an array and each index will
represent one gene. Strings can be represented as an array of characters which in turn is
an array of ASCII or numeric values.
1) Bit Flip Mutation —
In bit flip mutation, we select one or more genes (array indices) and flip their values i.e.
we change 1s to 0s and vice versa. It is better explained using the given diagram.

2) Random Resetting Mutation —


In random resetting mutation, we select one or more genes (array indices) and replace their
values with another random value from their given ranges. Let’s say a[i] (an array index /
gene) ranges from [1, 6] then random resetting mutation will select one value from [1, 6]
and replace a[i]’s value with it.

3) Swap Mutation —
In Swap Mutation we select two genes from our chromosome and interchange their values.

4) Scramble Mutation —
In Scramble Mutation we select a subset of our genes and scramble their value. The
selected genes may not be contiguous (see the second diagram).

191
5) Inversion Mutation —
In Inversion Mutation we select a subset of our genes and reverse their order. The genes
have to be contiguous in this case (see the diagram).

5.18 Genetic Programming


Genetic Programming (GP) extends the concept of genetic algorithms to evolve programs
or expressions. Instead of evolving a set of parameters or solutions, GP evolves entire
programs or expressions that can perform a task or solve a problem.
5.18.1 Key Components of Genetic Programming
i. Population: A set of programs or expressions.
ii. Fitness Function: A measure of how well a program or expression performs a given
task.
iii. Selection, Crossover, and Mutation: Similar to GAs, but applied to program
structures or expressions.
a. Key Principles of Evolutionary Algorithms
b. Evolutionary Algorithms (EAs), including GAs and GP, are based on several
fundamental principles:
iv. Natural Selection: The idea that better solutions are more likely to be selected for
reproduction.
v. Genetic Variation: Diversity in the population is introduced through crossover and
mutation to explore a wider solution space.
vi. Survival of the Fittest: Solutions are evaluated based on a fitness function, and the
fittest solutions are more likely to be selected for the next generation.
a. These principles ensure that the algorithm explores a variety of solutions and
converges towards optimal or near-optimal solutions.

192
5.19 Types of encoding in Genetic Algorithm

Genetic representations:
I. Encoding:
a. Encoding is a process of representing individual genes.
b. The process can be performed using bits, numbers, trees, arrays, lists or any other
objects.
c. The encoding depends mainly on solving the problem.
1. Binary encoding:
a. Binary encoding is the most commonly used method of genetic C representation
because GA uses this type of encoding.
b. In binary encoding, every chromosome is a string of bits, 0 or 1.
c. Chromosome A 101100101100101011100101
d. Chromosome B 111111100000110000011111
e. Binary encoding gives many possible chromosomes.

2. Octal or Hexadecimal encoding:


a. The encoding is done using octal or hexadecimal numbers
Chromosome Octal Hexadecimal
Chromosome A 54545345 B2CAE5
Chromosome B 77406037 FEOCIF

3. Permutation encoding (real number encoding):


a. Permutation encoding can be used in ordering problems, such as Travelling
Salesman Problem (TSP).
b. In permutation encoding, every chromosome is a string of numbers, which
represents number in a sequence.
Chromosome A 153264798
Chromosome B 856723149

4. Value encoding:
a. Direct value encoding can be used in problems, where some complicated
values, such as real numbers, are used.
b. In value encoding, every chromosome is a string of some values.
c. Values can be anything connected to problem, real numbers or chars to some
complicated objects.
Chromosome A 1.2324 5.3243 0.4556 2.3293 2.4545
Chromosome B ABDJEIFJDHDIERJFDLDFLFEGT
Chromosome C (back), (back), (right), (forward), (left)
5. Tree encoding:
a. Tree encoding is used for evolving programs or expressions, for genetic
programming.
b. In tree encoding, every chromosome is a tree of some objects, such as functions
or commands in programming language.
193
c. Programming language LISP is often used to this, because programs in it are
represented in this form and can be easily parsed as a tree, so the cross-over
and mutation can be done relatively easily.

5.20 The various methods of selecting


a. Roulette-wheel selection:
i. Roulette-wheel selection is the proportionate reproductive method where a string is
selected from the mating pool with a probability proportional to the fitness.
ii. Thus, ith string in the population is selected with a probability proportional to, where is the
fitness value for that string.
iii. Since the population size is usually kept fixed in Genetic Algorithm, the sum of the
probabilities of each string being selected for the mating pool must be one.
iv. The probability of the ith selected string is
n
i. Pi = Fi / ∑ j=1 Fj
Where 'n 'is the population size.
v. The average fitness is
F = ∑nj=1 Fj / n
b. Boltzmann selection:
i. Boltzmann selection uses the concept of simulated annealing.
ii. Simulated annealing is a method of functional minimization or maximization.
iii. This method simulates the process of slow cooling of molten metal to achieve the
minimum function value in a minimization problem.
iv. The cooling phenomenon is simulated by controlling a temperature so that a system in
thermal equilibrium at a temperature T 'has its energy distributed probabilistically
according to
1. P ( E ) = exp ( - E/kT b)
where 'b' is Boltzmann constant.
v. This expression suggests that a system at a high temperature has almost uniform
probability of being at any energy state, but at a low temperature it has a small probability
of being at a high energy state.
vi. Therefore, by controlling the temperature T and assuming search process follows
Boltzmann probability distribution, the convergence of the algorithm is controlled.
c. Tournament selection:
i. GA uses a strategy to select the individuals from population and insert them into a mating
pool.
ii. A selection strategy in GA is a process that favours the selection of better individuals in
the population for the mating pool.
iii. There are two important issues in the evolution process of genetic search.
1. Population diversity: Population diversity means that the genes from the already
discovered good individuals are exploited.
2. Selective pressure: Selective pressure is the degree to which the better individuals are
favoured. The higher the selective pressure the better individuals are favoured.
d. Rank selection:
i. Rank selection first ranks the population and takes every chromosome, receives fitness
from the ranking.

194
ii. The worst will have fitness 1, the next 2, .., and the best will have fitness N (N is the
number of chromosomes in the population).
iii. The method can lead to slow convergence because the best chromosome does not differ
so much from the other.
e. Steady-state selection:
i. The main idea of the selection is that bigger part of chromosome should survive to next
generation.
ii. GA works in the following way:
1. In every generation a few chromosomes are selected for creating new off springs.
2. Then, some chromosomes are removed and new offspring is placed in that place.
3. The rest of population survives a new generation.

195
Lecture: 40

5.21 Roulette-wheel based on fitness v/s Roulette-wheel based on rank

5.21.1 Roulette-wheel based on fitness


i. Population is selected with a probability that is directly proportional to their fitness
values.
ii. It computes selection probabilities according to their fitness values but do not sort the
individual in the population.
iii. It gives a chance to all the individuals in the population to be selected.
iv. Diversity in the population |is preserved.
5.21.2 Roulette-wheel based on Rank
i. Probability of a population being selected is based on its fitness rank.
ii. It first sort individuals in the population according to their fitness and then computes
selection probabilities according to their ranks rather than fitness values.
iii. It selects the individuals with highest rank in the population.
iv. Diversity in the population is not preserved.
Example:
i. Imagine a Roulette-wheel where all chrom0somes in the population are placed, each
chromosome has its place accordingly to its fitness function
ii. When the wheel is spun, the wheel will finally stop and pointer attached to it will points
to the one of chromosomes with bigger fitness value.
iii. The different between roulette-wheel selection based on fitness and rank.

5.22 Applications of genetic algorithms

1. Optimization: Genetic Algorithms are most commonly used in optimization problems wherein
we have to maximize or minimize a given objective function value under a given set of
constraints.
2. Economics: GAs are also used to characterize various economic models like the cobweb
model, game theory equilibrium resolution, asset pricing. Etc.
3. Neural networks: GAs is also used to train neural networks, particularly recurrent neural
networks.
4. Parallelization: GAs also have very good parallel capabilities, and prove to be very effective
means in solving certain problems, and also provide a good area for research.
5. Image processing: GAs is used for various digital image processing (DIP) Tasks as well like
dense pixel matching.
6. Machine learning: Genetics based machine learning (GBML) is a nice area in machine
learning.
7. Robot trajectory generation: GAs has been used to plan the path which a robot arm takes by
moving from one point to another.

196
Lecture: 41

5.23 Industrial Application


5.23.1 Optimization of travelling salesman problem using genetic algorithm
i. The TSP consist a number of cities, where each pair of cities has a corresponding distance.
ii. The aim is to visit all the cities such that the total distance travelled will be minimized.
iii. A solution, and therefore a chromosome which represents that solution to the TSP, can be given
as an order, that is, a path, of the cities.
iv. The GA process starts by supplying important information such as location of the city,
maximum number of generations, population size, probability of crossover and probability of
mutation.
v. An initial random population of chromosomes is generated and the fitness of each chromosome
is evaluated.
vi. The population is then transformed into a new population (the next generation) using three
genetic operators: selection, crossover and mutation.
vii. The selection operator is used to choose two parents from the current generation in order to
create a new child by crossover and/or mutation.
viii. The new generation contains a higher proportion of the characteristics possessed by the good
members of the previous generation and in this way good characteristics are spread over the
population and mixed with other good characteristics.
ix. After each generation, a new set of chromosomes where the size is equal to the initial
population size is evolved.
x. This transformation process from one generation to the next continues until the population
converges to the optimal solution, which usually occurs when a certain percentage of the
population (for example 90 Percent) has the same optimal chrom0some in which the best
individual is taken as the optimal solution.

15.23.2 Convergence of genetic algorithm


i. A genetic algorithm is usually said to converge when there is no significant improvement in
the values of fitness of the population from one generation to the next.
ii. One criterion for convergence may be such that when a fixed percentage of columns and rows
in population matrix becomes the same, it can be assumed that convergence is attained. The
fixed percentage may be 80% or 85%.
iii. In genetic algorithm as we proceed with more generations, there may not be much
improvement in the population fitness and the best individual may not change for subsequent
populations.
iv. As the generation progresses, the population gets filled with more fit individuals with only
slight deviation from the fitness of best individuals. so far found, and the average fitness comes
very close to the fitness of the best individuals.
v. The convergence criteria can be explained from schema point of view.
vi. A schema is a similarity template describing a subset of strings with Similarities at certain
positions. A schema represents a subset of all possible strings that have the same bits at certain
string positions.
vii. Since schema represents a robust of strings, we can associate fitness value with a schema, i.e.,
the average fitness of the schema.
viii. One can visualize GA's search for the optimal strings as a simultaneous competition among
schema increases the number of their instances in the population.
197
5.24 Important Questions

Q1: Explain the Genetic Algorithm with a flow chart.

Q2: What is Reinforcement learning? Describe briefly Reinforcement learning.

Q3: Explain Markov Decision Process.

Q4: Explain GA (Genetic algorithm) cycle of reproduction?

Q5: What are advantages and disadvantages of Genetic algorithm?

Q6: Differentiate between Q Learning and Machine Learning.

Q7: Explain various types of reinforcement learning techniques with suitable example.

Q8: Differentiate between Reinforcement and Supervised Learning.

Q9: What are the different types and elements of Reinforcement Learning?

Q10: Describe briefly different learning task used in Machine Learning

Q11: Explain approaches used to implement Reinforcement Learning Algorithm.

Q12: Describe Learning Models, challenges and applications of Reinforcement Learning.

Q13: Describe Q-Learning Algorithm Process and steps involved in Deep Q-Learning Network.

Q14: Explain different phases of Genetic Algorithm with advantages and disadvantages.

Q15: Write Short notes on Procedures and Representations of Genetic Algorithm.

Q16: Explain different types of Encoding and benefits of Genetic Algorithm.

Q17: Explain different methods of selection in Genetic Algorithm in order to select a population
for next generation.

Q18: Write Short notes on “Genetic Programming”

198
Meerut Institute of Engineering & Technology, Meerut
NH-58, Bypass Road, Baghpat Crossing, Meerut 250 005, U.P., INDIA

You might also like