0% found this document useful (0 votes)
81 views82 pages

Wa0001.

1) The document is a textbook on machine learning techniques for third year B.Tech students. It covers topics like regression, Bayesian learning, decision trees, artificial neural networks, reinforcement learning, and genetic algorithms. 2) The textbook is divided into 5 units covering introduction to machine learning approaches, regression and Bayesian learning, decision tree learning, artificial neural networks, and reinforcement learning. 3) Each unit covers key concepts in the topic area along with examples and applications. Short questions and learning outcomes are also provided at the end.

Uploaded by

priyanshu82269
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views82 pages

Wa0001.

1) The document is a textbook on machine learning techniques for third year B.Tech students. It covers topics like regression, Bayesian learning, decision trees, artificial neural networks, reinforcement learning, and genetic algorithms. 2) The textbook is divided into 5 units covering introduction to machine learning approaches, regression and Bayesian learning, decision tree learning, artificial neural networks, and reinforcement learning. 3) Each unit covers key concepts in the topic area along with examples and applications. Short questions and learning outcomes are also provided at the end.

Uploaded by

priyanshu82269
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

1

QUANTUM SERIES

For
B.Tech Students of Third Year
of All Engineering Colleges Affiliated to
Dr. A.P.J. Abdul Kalam Technical University,
Uttar Pradesh, Lucknow
(Formerly Uttar Pradesh Technical University)

Machine Learning Techniques


By

Kanika Dhama

TM

QUANTUM PAGE PVT. LTD.


Ghaziabad New Delhi
2 3

PUBLISHED BY : Apram Singh CONTENTS


Quantum Publications
(A Unit of Quantum Page Pvt. Ltd.)
KCS-055 : MACHINE LEARNING TECHNIQUES
Plot No. 59/2/7, Site - 4, Industrial Area, UNIT-1 : INTRODUCTION (1–1 L to 1–26 L)
Sahibabad, Ghaziabad-201 010 Learning, Types of Learning, Well defined learning problems,
Designing a Learning System, History of ML, Introduction of
Phone : 0120 - 4160479 Machine Learning Approaches – (Artificial Neural Network,
Clustering, Reinforcement Learning, Decision Tree Learning,
Email : [email protected] Website: www.quantumpage.co.in Bayesian networks, Support Vector Machine, Genetic Algorithm),
Issues in Machine Learning and Data Science Vs Machine Learning.
Delhi Office : 1/6590, East Rohtas Nagar, Shahdara, Delhi-110032
UNIT-2 : REGRESSION & BAYESIAN LEARNING (2–1 L to 2–24 L)
REGRESSION: Linear Regression and Logistic Regression.
© ALL RIGHTS RESERVED BAYESIAN LEARNING - Bayes theorem, Concept learning, Bayes
No part of this publication may be reproduced or transmitted, Optimal Classifier, Naïve Bayes classifier, Bayesian belief networks,
in any form or by any means, without permission. EM algorithm. SUPPORT VECTOR MACHINE: Introduction,
Types of support vector kernel – (Linear kernel, polynomial
kernel,and Gaussiankernel), Hyperplane – (Decision surface),
Properties of SVM, and Issues in SVM.
Information contained in this work is derived from sources UNIT-3 : DECISION TREE LEARNING (3–1 L to 3–27 L)
believed to be reliable. Every effort has been made to ensure DECISION TREE LEARNING - Decision tree learning algorithm,
accuracy, however neither the publisher nor the authors Inductive bias, Inductive inference with decision trees, Entropy
and information theory, Information gain, ID-3 Algorithm, Issues
guarantee the accuracy or completeness of any information in Decision tree learning. INSTANCE-BASED LEARNING – k-
published herein, and neither the publisher nor the authors Nearest Neighbour Learning, Locally Weighted Regression, Radial
shall be responsible for any errors, omissions, or damages basis function networks, Case-based learning.
arising out of use of this information. UNIT-4 : ARTIFICIAL NEURAL NETWORKS (4–1 L to 4–31 L)
ARTIFICIAL NEURAL NETWORKS – Perceptron’s, Multilayer
perceptron, Gradient descent & the Delta rule, Multilayer networks,
Derivation of Backpropagation Algorithm, Generalization,
Machine Learning Techniques (CS/IT : Sem-5) Unsupervised Learning – SOM Algorithm and its variant; DEEP
1st Edition : 2020-21 LEARNING - Introduction, concept of convolutional neural network,
Types of layers – (Convolutional Layers, Activation function, pooling,
fully connected), Concept of Convolution (1D and 2D) layers, Training
of network, Case study of CNN for eg on Diabetic Retinopathy,
Building a smart speaker, Self-deriving car etc.
UNIT-5 : REINFORCEMENT LEARNING (5–1 L to 5–30 L)
REINFORCEMENT LEARNING–Introduction to Reinforcement
Learning , Learning Task,Example of Reinforcement Learning in Practice,
Learning Models for Reinforcement – (Markov Decision process, Q
Learning - Q Learning function, Q Learning Algorithm ), Application of
Reinforcement Learning,Introduction to Deep Q Learning.
GENETIC ALGORITHMS: Introduction, Components, GA cycle
Price: Rs. 65/- only
of reproduction, Crossover, Mutation, Genetic Programming,
Models of Evolution and Learning, Applications.

Printed Version : e-Book. SHORT QUESTIONS (SQ–1 L to SQ–19 L)


Machine Learning Techniques (KCS 055)
Course Outcome ( CO) Bloom’s Knowledge Level (KL)

At the end of course , the student will be able:

CO 1 To understand the need for machine learning for various problem solving K1 , K2

To understand a wide variety of learning algorithms and how to evaluate models generated K1 , K3
CO 2
from data
CO 3 To understand the latest trends in machine learning K2 , K3

To design appropriate machine learning algorithms and apply the algorithms to a real-world K4 , K6
CO 4
problems
To optimize the models learned and report on the expected accuracy that can be achieved by K4, K5
CO 5
applying the models
DETAILED SYLLABUS 3-0-0
Unit Topic Proposed
Lecture
INTRODUCTION – Learning, Types of Learning, Well defined learning problems, Designing a
Learning System, History of ML, Introduction of Machine Learning Approaches – (Artificial
I 08
Neural Network, Clustering, Reinforcement Learning, Decision Tree Learning, Bayesian
networks, Support Vector Machine, Genetic Algorithm), Issues in Machine Learning and Data
Science Vs Machine Learning;
REGRESSION: Linear Regression and Logistic Regression
BAYESIAN LEARNING - Bayes theorem, Concept learning, Bayes Optimal Classifier, Naïve
II Bayes classifier, Bayesian belief networks, EM algorithm. 08
SUPPORT VECTOR MACHINE: Introduction, Types of support vector kernel – (Linear
kernel, polynomial kernel,and Gaussiankernel), Hyperplane – (Decision surface), Properties of
SVM, and Issues in SVM.
DECISION TREE LEARNING - Decision tree learning algorithm, Inductive bias, Inductive
inference with decision trees, Entropy and information theory, Information gain, ID-3 Algorithm,
III 08
Issues in Decision tree learning.
INSTANCE-BASED LEARNING – k-Nearest Neighbour Learning, Locally Weighted
Regression, Radial basis function networks, Case-based learning.
ARTIFICIAL NEURAL NETWORKS – Perceptron’s, Multilayer perceptron, Gradient
descent and the Delta rule, Multilayer networks, Derivation of Backpropagation Algorithm,
Generalization, Unsupervised Learning – SOM Algorithm and its variant;
IV DEEP LEARNING - Introduction,concept of convolutional neural network , Types of layers – 08
(Convolutional Layers , Activation function , pooling , fully connected) , Concept of Convolution
(1D and 2D) layers, Training of network, Case study of CNN for eg on Diabetic Retinopathy,
Building a smart speaker, Self-deriving car etc.
REINFORCEMENT LEARNING–Introduction to Reinforcement Learning , Learning
Task,Example of Reinforcement Learning in Practice, Learning Models for Reinforcement –
(Markov Decision process , Q Learning - Q Learning function, Q Learning Algorithm ),
V 08
Application of Reinforcement Learning,Introduction to Deep Q Learning.
GENETIC ALGORITHMS: Introduction, Components, GA cycle of reproduction, Crossover,
Mutation, Genetic Programming, Models of Evolution and Learning, Applications.

Text books:
1. Tom M. Mitchell, ―Machine Learning, McGraw-Hill Education (India) Private Limited, 2013.
2. Ethem Alpaydin, ―Introduction to Machine Learning (Adaptive Computation and
Machine Learning), The MIT Press 2004.
3. Stephen Marsland, ―Machine Learning: An Algorithmic Perspective, CRC Press, 2009.
4. Bishop, C., Pattern Recognition and Machine Learning. Berlin: Springer-Verlag.
Machine Learning Techniques 1–1 L (CS/IT-Sem-5) Introduction 1–2 L (CS/IT-Sem-5)

1
PART-1
Learning, Types of Learning.

Questions-Answers
Introduction Long Answer Type and Medium Answer Type Questions

Que 1.1. Define the term learning. What are the components of a
learning system ?
CONTENTS Answer
1. Learning refers to the change in a subject’s behaviour to a given situation
Part-1 : Learning, Types of Learning .................... 1–2L to 1–7L
brought by repeated experiences in that situation, provided that the
behaviour changes cannot be explained on the basis of native response
Part-2 : Well Defined Learning .............................. 1–7L to 1–9L
tendencies, matriculation or temporary states of the subject.
Problems, Designing a
Learning System 2. Learning agent can be thought of as containing a performance element
that decides what actions to take and a learning element that modifies
Part-3 : History of ML, Introduction .................... 1–9L to 1–24L the performance element so that it makes better decisions.
of Machine Learning Approaches :
3. The design of a learning element is affected by three major issues :
(Artificial Neural Network,
Clustering, Reinforcement a. Components of the performance element.
Learning, Decision Tree Learning, b. Feedback of components.
Bayesian Network, Support Vector
c. Representation of the components.
Machine, Genetic Algorithm)
The important components of learning are :
Part-4 : Issues in Machine Learning ................. 1–24L to 1–26L
and Data Science Vs. Stimuli
Machine Learning examples
Learner Feedback
Li
component

Environment
or teacher Critic
Knowledge
performance
base
evaluator

Response
Performance
component
Tasks
Fig. 1.1.1. General learning model.
1. Acquisition of new knowledge :
a. One component of learning is the acquisition of new knowledge.
Machine Learning Techniques 1–3 L (CS/IT-Sem-5) Introduction 1–4 L (CS/IT-Sem-5)

b. Simple data acquisition is easy for computers, even though it is b. Lacking good complexity metrics, this measure will often be
difficult for people. somewhat subjective.
2. Problem solving : Que 1.3. Discuss supervised and unsupervised learning.
The other component of learning is the problem solving that is required
for both to integrate into the system, new knowledge that is presented Answer
to it and to deduce new information when required facts are not been
Supervised learning :
presented.
1. Supervised learning is also known as associative learning, in which
Que 1.2. Write down the performance measures for learning. the network is trained by providing it with input and matching
output patterns.
Answer 2. Supervised training requires the pairing of each input vector with
Following are the performance measures for learning are : a target vector representing the desired output.
1. Generality : 3. The input vector together with the corresponding target vector is
a. The most important performance measure for learning methods is called training pair.
the generality or scope of the method. Input feature Target feature
Matching
b. Generality is a measure of the case with which the method can be
adapted to different domains of application. Neural – +
c. A completely general algorithm is one which is a fixed or self adjusting network
configuration that can learn or adapt in any environment or
application domain. Weight/threshold
adjustment Error
2. Efficiency :
vector
a. The efficiency of a method is a measure of the average time required Supervised
to construct the target knowledge structures from some specified learning
initial structures. algorithm
b. Since this measure is often difficult to determine and is meaningless
Fig. 1.3.1.
without some standard comparison time, a relative efficiency index
can be used instead. 4. During the training session an input vector is applied to the network,
3. Robustness : and it results in an output vector.
a. Robustness is the ability of a learning system to function with 5. This response is compared with the target response.
unreliable feedback and with a variety of training examples, including 6. If the actual response differs from the target response, the network
noisy ones. will generate an error signal.
b. A robust system must be able to build tentative structures which 7. This error signal is then used to calculate the adjustment that
are subjected to modification or withdrawal if later found to be should be made in the synaptic weights so that the actual output
inconsistent with statistically sound structures. matches the target output.
4. Efficacy : 8. The error minimization in this kind of training requires a supervisor
a. The efficacy of a system is a measure of the overall power of the or teacher.
system. It is a combination of the factors generality, efficiency, and 9. These input-output pairs can be provided by an external teacher, or
robustness. by the system which contains the neural network (self-supervised).
5. Ease of implementation : 10. Supervised training methods are used to perform non-linear
a. Ease of implementation relates to the complexity of the programs mapping in pattern classification networks, pattern association
and data structures, and the resources required to develop the networks and multilayer neural networks.
given learning system.
Machine Learning Techniques 1–5 L (CS/IT-Sem-5) Introduction 1–6 L (CS/IT-Sem-5)

11. Supervised learning generates a global model that maps input objects 14. Unsupervised learning is useful for data compression and clustering.
to desired outputs. Vector describing state
12. In some cases, the map is implemented as a set of local models such of the environment
as in case-based reasoning or the nearest neighbour algorithm. Learning
Environment
13. In order to solve problem of supervised learning following steps are system
considered :
Fig. 1.3.2. Block diagram of unsupervised learning.
i. Determine the type of training examples.
15. In unsupervised learning, system is supposed to discover statistically
ii. Gathering a training set. salient features of the input population.
iii. Determine the input feature representation of the learned 16. Unlike the supervised learning paradigm, there is not a priori set of
function. categories into which the patterns are to be classified; rather the
iv. Determine the structure of the learned function and system must develop its own representation of the input stimuli.
corresponding learning algorithm.
Que 1.4. Describe briefly reinforcement learning ?
v. Complete the design.
Unsupervised learning :
Answer
1. It is a learning in which an output unit is trained to respond to
1. Reinforcement learning is the study of how artificial system can learn to
clusters of pattern within the input.
optimize their behaviour in the face of rewards and punishments.
2. Unsupervised training is employed in self-organizing neural
2. Reinforcement learning algorithms have been developed that are closely
networks.
related to methods of dynamic programming which is a general approach
3. This training does not require a teacher. to optimal control.
4. In this method of training, the input vectors of similar types are 3. Reinforcement learning phenomena have been observed in psychological
grouped without the use of training data to specify how a typical studies of animal behaviour, and in neurobiological investigations of
member of each group looks or to which group a member belongs. neuromodulation and addiction.
5. During training the neural network receives input patterns and Primary
organizes these patterns into categories. State (input) reinforcement
6. When new input pattern is applied, the neural network provides an vector signal
Environment Critic
output response indicating the class to which the input pattern
belongs. Heuristic
reinforcement
7. If a class cannot be found for the input pattern, a new class is
Actions signal
generated.
Learning
8. Though unsupervised training does not require a teacher, it requires system
certain guidelines to form groups.
9. Grouping can be done based on color, shape and any other property
Fig. 1.4.1. Block diagram of reinforcement learning.
of the object.
10. It is a method of machine learning where a model is fit to 4. The task of reinforcement learning is to use observed rewards to learn
observations. an optimal policy for the environment.
5. An optimal policy is a policy that maximizes the expected total reward.
11. It is distinguished from supervised learning by the fact that there is
no priori output. 6. Without some feedback about what is good and what is bad, the agent
will have no grounds for deciding which move to make.
12. In this, a data set of input objects is gathered.
7. The agents need to know that something good has happened when it
13. It treats input objects as a set of random variables. It can be used in
wins and that something bad has happened when it loses.
conjunction with Bayesian inference to produce conditional
probabilities. 8. This kind of feedback is called a reward or reinforcement.
Machine Learning Techniques 1–7 L (CS/IT-Sem-5) Introduction 1–8 L (CS/IT-Sem-5)

9. Reinforcement learning is very valuable in the field of robotics, where


the tasks to be performed are frequently complex enough to defy
encoding as programs and no training data is available. Questions-Answers

10. The robot’s task consists of finding out, through trial and error (or
Long Answer Type and Medium Answer Type Questions
success), which actions are good in a certain situation and which are
not.
11. In many cases humans learn in a very similar way.
Que 1.6. Write short note on well defined learning problem with
12. For example, when a child learns to walk, this usually happens without
instruction, rather simply through reinforcement. example.
13. Successful attempts at working are rewarded by forward progress, and
Answer
unsuccessful attempts are penalized by often painful falls.
14. Positive and negative reinforcement are also important factors in Well defined learning problem :
successful learning in school and in many sports. A computer program is said to learn from experience E with respect to some
15. In many complex domains, reinforcement learning is the only feasible class of tasks T and performance measure P, if its performance at tasks in T,
way to train a program to perform at high levels. as measured by P, improves with experience E.
Three features in learning problems :
Que 1.5. What are the steps used to design a learning system ?
1. The class of tasks (T)
Answer 2. The measure of performance to be improved (P)
3. The source of experience (E)
Steps used to design a learning system are :
For example :
1. Specify the learning task.
1. A checkers learning problem :
2. Choose a suitable set of training data to serve as the training experience.
a. Task (T) : Playing checkers.
3. Divide the training data into groups or classes and label accordingly.
b. Performance measure (P) : Percent of games won against
4. Determine the type of knowledge representation to be learned from the
opponents.
training experience.
c. Training experience (E) : Playing practice games against itself.
5. Choose a learner classifier that can generate general hypotheses from
the training data. 2. A handwriting recognition learning problem :
6. Apply the learner classifier to test data. a. Task (T) : Recognizing and classifying handwritten words within
images.
7. Compare the performance of the system with that of an expert human.
b. Performance measure (P) : Percent of words correctly classified.
Learner
c. Training experience (E) : A database of handwritten words with
given classifications.
Environment/
Experience Knowledge 3. A robot driving learning problem :
a. Task (T) : Driving on public four-lane highways using vision sensors.
Performance b. Performance measure (P) : Average distance travelled before an
element error (as judged by human overseer).
Fig. 1.5.1. c. Training experience (E) : A sequence of images and steering
commands recorded while observing a human driver.

PART-2 Que 1.7. Describe well defined learning problems role’s in


Well Defined Learning Problems, Designing a Learning System. machine learning.
Machine Learning Techniques 1–9 L (CS/IT-Sem-5) Introduction 1–10 L (CS/IT-Sem-5)

Answer
Questions-Answers
Well defined learning problems role’s in machine learning :
1. Learning to recognize spoken words : Long Answer Type and Medium Answer Type Questions
a. Successful speech recognition systems employ machine learning in
some form.
b. For example, the SPHINX system learns speaker-specific strategies Que 1.8. Describe briefly the history of machine learning.
for recognizing the primitive sounds (phonemes) and words from
the observed speech signal. Answer
c. Neural network learning methods and methods for learning hidden A. Early history of machine learning :
Markov models are effective for automatically customizing to
1. In 1943, neurophysiologist Warren McCulloch and mathematician Walter
individual speakers, vocabularies, microphone characteristics,
Pitts wrote a paper about neurons, and how they work. They created a
background noise, etc.
model of neurons using an electrical circuit, and thus the neural network
2. Learning to drive an autonomous vehicle : was created.
a. Machine learning methods have been used to train computer 2. In 1952, Arthur Samuel created the first computer program which could
controlled vehicles to steer correctly when driving on a variety of learn as it ran.
road types.
3. Frank Rosenblatt designed the first artificial neural network in 1958,
b. For example, the ALYINN system has used its learned strategies to called Perceptron. The main goal of this was pattern and shape
drive unassisted at 70 miles per hour for 90 miles on public highways recognition.
among other cars.
4. In 1959, Bernard Widrow and Marcian Hoff created two models of neural
3. Learning to classify new astronomical structures : network. The first was called ADELINE, and it could detect binary
a. Machine learning methods have been applied to a variety of large patterns. For example, in a stream of bits, it could predict what the next
databases to learn general regularities implicit in the data. one would be. The second was called MADELINE, and it could eliminate
b. For example, decision tree learning algorithms have been used by echo on phone lines.
NASA to learn how to classify celestial objects from the second B. 1980s and 1990s :
Palomar Observatory Sky Survey. 1. In 1982, John Hopfield suggested creating a network which had
c. This system is used to automatically classify all objects in the Sky bidirectional lines, similar to how neurons actually work.
Survey, which consists of three terabytes of image data. 2. Use of back propagation in neural networks came in 1986, when
4. Learning to play world class backgammon : researchers from the Stanford psychology department decided to extend
a. The most successful computer programs for playing games such as an algorithm created by Widrow and Hoff in 1962. This allowed multiple
backgammon are based on machine learning algorithms. layers to be used in a neural network, creating what are known as ‘slow
learners’, which will learn over a long period of time.
b. For example, the world's top computer program for backgammon,
TD-GAMMON learned its strategy by playing over one million 3. In 1997, the IBM computer Deep Blue, which was a chess-playing
practice games against itself. computer, beat the world chess champion.
4. In 1998, research at AT&T Bell Laboratories on digit recognition resulted
in good accuracy in detecting handwritten postcodes from the US Postal
PART-3 Service.
History of ML, Introduction of Machine Learning C. 21st Century :
Approaches - (Artificial Neural Network, Clustering, Reinforcement 1. Since the start of the 21st century, many businesses have realised that
Learning, Decision Tree Learning, Bayesian Network, Support machine learning will increase calculation potential. This is why they
Vector Machine, Genetic Algorithm). are researching more heavily in it, in order to stay ahead of the
competition.
Machine Learning Techniques 1–11 L (CS/IT-Sem-5) Introduction 1–12 L (CS/IT-Sem-5)

2. Some large projects include : c. In speech recognition, a software application recognizes spoken
i. GoogleBrain (2012) words.
ii. AlexNet (2012) 3. Medical diagnosis :
iii. DeepFace (2014) a. ML provides methods, techniques, and tools that can help in solving
diagnostic and prognostic problems in a variety of medical domains.
iv. DeepMind (2014)
b. It is being used for the analysis of the importance of clinical
v. OpenAI (2015) parameters and their combinations for prognosis.
vi. ResNet (2015) 4. Statistical arbitrage :
vii. U-net (2015) a. In finance, statistical arbitrage refers to automated trading
strategies that are typical of a short-term and involve a large number
Que 1.9. Explain briefly the term machine learning. of securities.
b. In such strategies, the user tries to implement a trading algorithm
Answer
for a set of securities on the basis of quantities such as historical
1. Machine learning is an application of Artificial Intelligence (AI) that correlations and general economic variables.
provides systems the ability to automatically learn and improve from 5. Learning associations : Learning association is the process for
experience without being explicitly programmed. discovering relations between variables in large data base.
2. Machine learning focuses on the development of computer programs 6. Extraction :
that can access data.
a. Information Extraction (IE) is another application of machine
3. The primary aim is to allow the computers to learn automatically without learning.
human intervention or assistance and adjust actions accordingly.
b. It is the process of extracting structured information from
4. Machine learning enables analysis of massive quantities of data. unstructured data.
5. It generally delivers faster and more accurate results in order to identify
Que 1.11. What are the advantages and disadvantages of machine
profitable opportunities or dangerous risks.
6. Combining machine learning with AI and cognitive technologies can learning ?
make it even more effective in processing large volumes of information.
Answer
Que 1.10. What are the applications of machine learning ? Advantages of machine learning are :
1. Easily identifies trends and patterns :
Answer a. Machine learning can review large volumes of data and discover
specific trends and patterns that would not be apparent to humans.
Following are the applications of machine learning :
b. For an e-commerce website like Flipkart, it serves to understand
1. Image recognition :
the browsing behaviours and purchase histories of its users to help
a. Image recognition is the process of identifying and detecting an cater to the right products, deals, and reminders relevant to them.
object or a feature in a digital image or video.
c. It uses the results to reveal relevant advertisements to them.
b. This is used in many applications like systems for factory automation,
2. No human intervention needed (automation) : Machine learning
toll booth monitoring, and security surveillance.
does not require physical force i.e., no human intervention is needed.
2. Speech recognition :
3. Continuous improvement :
a. Speech Recognition (SR) is the translation of spoken words into
text. a. ML algorithms gain experience, they keep improving in accuracy
and efficiency.
b. It is also known as Automatic Speech Recognition (ASR), computer
b. As the amount of data keeps growing, algorithms learn to make
speech recognition, or Speech To Text (STT).
accurate predictions faster.
Machine Learning Techniques 1–13 L (CS/IT-Sem-5) Introduction 1–14 L (CS/IT-Sem-5)

4. Handling multi-dimensional and multi-variety data : Disadvantages of unsupervised machine learning algorithm :
a. Machine learning algorithms are good at handling data that are 1. The spectral classes do not necessarily represent the features on the
multi-dimensional and multi-variety, and they can do this in dynamic ground.
or uncertain environments.
2. It does not consider spatial relationships in the data.
Disadvantages of machine learning are :
3. It can take time to interpret the spectral classes.
1. Data acquisition :
Advantages of semi-supervised machine learning algorithm :
a. Machine learning requires massive data sets to train on, and these
should be inclusive/unbiased, and of good quality. 1. It is easy to understand.
2. Time and resources : 2. It reduces the amount of annotated data used.
a. ML needs enough time to let the algorithms learn and develop 3. It is stable, fast convergent.
enough to fulfill their purpose with a considerable amount of 4. It is simple.
accuracy and relevancy.
5. It has high efficiency.
b. It also needs massive resources to function.
Disadvantages of semi-supervised machine learning algorithm :
3. Interpretation of results :
a. To accurately interpret results generated by the algorithms. We 1. Iteration results are not stable.
must carefully choose the algorithms for our purpose. 2. It is not applicable to network level data.
4. High error-susceptibility : 3. It has low accuracy.
a. Machine learning is autonomous but highly susceptible to errors. Advantages of reinforcement learning algorithm :
b. It takes time to recognize the source of the issue, and even longer 1. Reinforcement learning is used to solve complex problems that cannot
to correct it. be solved by conventional techniques.
Que 1.12. What are the advantages and disadvantages of different 2. This technique is preferred to achieve long-term results which are very
types of machine learning algorithm ? difficult to achieve.
3. This learning model is very similar to the learning of human beings.
Answer Hence, it is close to achieving perfection.
Advantages of supervised machine learning algorithm : Disadvantages of reinforcement learning algorithm :
1. Classes represent the features on the ground. 1. Too much reinforcement learning can lead to an overload of states
2. Training data is reusable unless features change. which can diminish the results.
Disadvantages of supervised machine learning algorithm : 2. Reinforcement learning is not preferable for solving simple problems.
1. Classes may not match spectral classes. 3. Reinforcement learning needs a lot of data and a lot of computation.
2. Varying consistency in classes. 4. The curse of dimensionality limits reinforcement learning for real
3. Cost and time are involved in selecting training data. physical systems.
Advantages of unsupervised machine learning algorithm :
Que 1.13. Write short note on Artificial Neural Network (ANN).
1. No previous knowledge of the image area is required.
2. The opportunity for human error is minimised. Answer
3. It produces unique spectral classes. 1. Artificial Neural Networks (ANN) or neural networks are computational
4. Relatively easy and fast to carry out. algorithms that intended to simulate the behaviour of biological systems
composed of neurons.
Machine Learning Techniques 1–15 L (CS/IT-Sem-5) Introduction 1–16 L (CS/IT-Sem-5)

2. ANNs are computational models inspired by an animal’s central nervous 7. In clustering, the class labels are not present in training data simply
systems. because they are not known to cluster the data objects.
3. It is capable of machine learning as well as pattern recognition. 8. Hence, it is the type of unsupervised learning.
4. A neural network is an oriented graph. It consists of nodes which in the 9. For this reason, clustering is a form of learning by observation rather
biological analogy represent neurons, connected by arcs. than learning by examples.
5. It corresponds to dendrites and synapses. Each arc associated with a 10. There are certain situations where clustering is useful. These include :
weight at each node. a. The collection and classification of training data can be costly and
6. A neural network is a machine learning algorithm based on the model time consuming. Therefore it is difficult to collect a training data
of a human neuron. The human brain consists of millions of neurons. set. A large number of training samples are not all labelled. Then it
7. It sends and process signals in the form of electrical and chemical signals. is useful to train a supervised classifier with a small portion of
training data and then use clustering procedures to tune the classifier
8. These neurons are connected with a special structure known as synapses. based on the large, unclassified dataset.
Synapses allow neurons to pass signals.
b. For data mining, it can be useful to search for grouping among the
9. An Artificial Neural Network is an information processing technique. It
data and then recognize the cluster.
works like the way human brain processes information.
c. The properties of feature vectors can change over time. Then,
10. ANN includes a large number of connected processing units that work supervised classification is not reasonable. Because the test feature
together to process information. They also generate meaningful results
vectors may have completely different properties.
from it.
d. The clustering can be useful when it is required to search for good
Que 1.14. Write short note on clustering. parametric families for the class conditional densities, in case of
supervised classification.
Answer Que 1.15. What are the applications of clustering ?
1. Clustering is a division of data into groups of similar objects.
2. Each group or cluster consists of objects that are similar among themselves Answer
and dissimilar to objects of other groups as shown in Fig. 1.14.1. Following are the applications of clustering :
1. Data reduction :
a. In many cases, the amount of available data is very large and its
processing becomes complicated.
b. Cluster analysis can be used to group the data into a number of
clusters and then process each cluster as a single entity.
c. In this way, data compression is achieved.
2. Hypothesis generation :
Fig. 1.14.1. Clusters. a. In this case, cluster analysis is applied to a data set to infer hypothesis
3. A cluster is a collection of data objects that are similar to one another that concerns about the nature of the data.
within the same cluster and are dissimilar to the object in the other b. Clustering is used here to suggest hypothesis that must be verified
cluster. using other data sets.
4. Clusters may be described as connected regions of a multidimensional 3. Hypothesis testing : In this context, cluster analysis is used for the
space containing relatively high density points, separated from each verification of the validity of a specific hypothesis.
other by a region containing a relatively low density points.
4. Prediction based on groups :
5. From the machine learning perspective, clustering can be viewed as
a. In this case, cluster analysis is applied to the available data set and
unsupervised learning of concepts.
then the resulting clusters are characterized based on the
6. Clustering analyzes data objects without help of known class label. characteristics of the patterns by which they are formed.
Machine Learning Techniques 1–17 L (CS/IT-Sem-5) Introduction 1–18 L (CS/IT-Sem-5)

b. In this sequence, if an unknown pattern is given, we can determine


the cluster to which it is more likely to belong and characterize it Answer
based on the characterization of the respective cluster. 1. Clustering techniques are used for combining observed examples into
clusters or groups which satisfy two following main criteria :
Que 1.16. Differentiate between clustering and classification.
a. Each group or cluster is homogeneous i.e., examples belong to the
same group are similar to each other.
Answer
b. Each group or cluster should be different from other clusters i.e.,
S.No. Clustering Classification examples that belong to one cluster should be different from the
examples of the other clusters.
1. Clustering analyzes data objects In classification, data are
without known class label. grouped by analyzing the data 2. Depending on the clustering techniques, clusters can be expressed in
objects whose class label is different ways :
known. a. Identified clusters may be exclusive, so that any example belongs to
only one cluster.
2. There is no prior knowledge of There is some prior
the attributes of the data to form knowledge of the attributes of b. They may be overlapping i.e., an example may belong to several
clusters. each classification. clusters.
c. They may be probabilistic i.e., an example belongs to each cluster
3. It is done by grouping only the It is done by classifying output
with a certain probability.
input data because output is not based on the values of the
predefined. input data. d. Clusters might have hierarchical structure.
Major classifications of clustering techniques are :
4. The number of clusters is not The number of classes is
known before clustering. These known before classification as Clustering
are identified after the there is predefined output
completion of clustering. based on input data.
Hierarchical Partitional
6
5 6 6
5 5
4 C
4 4
3
2
3 3
Divisive Agglomerative Centroid Model Graphic Spectral
2 A 2
1 1
B
1 based theoretic
0 0 0
1 2 3 4 1 2 3 4 1 2 3 4
Unknown class label Known class label Data objects Fig. 1.17.1. Types of clustering.
a. Once a criterion function has been selected, clustering becomes a
well-defined problem in discrete optimization. We find those
6 partitions of the set of samples that extremize the criterion function.
5
4 c. The sample set is finite, there are only a finite number of possible
3
5. 2 partitions.
1
0
1 2 3 4 d. The clustering problem can always be solved by exhaustive
enumeration.
6. It is considered as unsupervised It is considered as the
learning because there is no prior supervised learning because 1. Hierarchical clustering :
knowledge of the class labels. class labels are known before. a. This method works by grouping data object into a tree of clusters.
b. This method can be further classified depending on whether the
hierarchical decomposition is formed in bottom up (merging) or top
Que 1.17.
3.17. What are the various clustering techniques ? down (splitting) fashion.
Following are the two types of hierarchical clustering :
Machine Learning Techniques 1–19 L (CS/IT-Sem-5) Introduction 1–20 L (CS/IT-Sem-5)

a. Agglomerative hierarchical clustering : This bottom up strategy


starts by placing each object in its own cluster and then merges Que 1.19. Explain decision tree in detail.
these atomic clusters into larger and larger clusters, until all of the
objects are in a single cluster. Answer
b. Divisive hierarchical clustering : 1. A decision tree is a flowchart structure in which each internal node
i. This top down strategy does the reverse of agglomerative represents a test on a feature, each leaf node represents a class label
strategy by starting with all objects in one cluster. and branches represent conjunctions of features that lead to those class
ii. It subdivides the cluster into smaller and smaller pieces until labels.
each object forms a cluster on its own. 2. The paths from root to leaf represent classification rules.
2. Partitional clustering : 3. Fig 1.19.1, illustrate the basic flow of decision tree for decision making
a. This method first creates an initial set of number of partitions with labels (Rain(Yes), Rain(No)).
where each partition represents a cluster.
Outlook
b. The clusters are formed to optimize an objective partition criterion
such as a dissimilarity function based on distance so that the objects
within a cluster are similar whereas the objects of different clusters
are dissimilar. Sunny Overcast Rain
Following are the types of partitioning methods :
a. Centroid based clustering : Humidity Yes Wind
i. In this, it takes the input parameter and partitions a set of
object into a number of clusters so that resulting intracluster
similarity is high but the intercluster similarity is low.
High Normal Strong Weak
ii. Cluster similarity is measured in terms of the mean value of
the objects in the cluster, which can be viewed as the cluster’s
No Yes No Yes
centroid or center of gravity.
Fig. 1.19.1.
b. Model-based clustering : This method hypothesizes a model for
each of the cluster and finds the best fit of the data to that model. 4. Decision tree is the predictive modelling approach used in statistics, data
mining and machine learning.
Que 1.18. Describe reinforcement learning.
5. Decision trees are constructed via an algorithmic approach that identifies
the ways to split a data set based on different conditions.
Answer
6. Decision trees are a non-parametric supervised learning method used
1. Reinforcement learning is the study of how animals and artificial systems for both classification and regression tasks.
can learn to optimize their behaviour in the face of rewards and
7. Classification trees are the tree models where the target variable can
punishments.
take a discrete set of values.
2. Reinforcement learning algorithms related to methods of dynamic
8. Regression trees are the decision trees where the target variable can
programming which is a general approach to optimal control.
take continuous set of values.
3. Reinforcement learning phenomena have been observed in psychological
studies of animal behaviour, and in neurobiological investigations of Que 1.20. What are the steps used for making decision tree ?
neuromodulation and addiction.
4. The task of reinforcement learning is to use observed rewards to learn Answer
an optimal policy for the environment. An optimal policy is a policy that Steps used for making decision tree are :
maximizes the expected total reward.
1. Get list of rows (dataset) which are taken into consideration for making
decision tree (recursively at each node).
Machine Learning Techniques 1–21 L (CS/IT-Sem-5) Introduction 1–22 L (CS/IT-Sem-5)

2. Calculate uncertainty of our dataset or Gini impurity or how much our 3. A Belief Network allows class conditional independencies to be defined
data is mixed up etc. between subsets of variables.
3. Generate list of all question which needs to be asked at that node. 4. It provides a graphical model of causal relationship on which learning
4. Partition rows into True rows and False rows based on each question can be performed.
asked. 5. We can use a trained Bayesian network for classification.
5. Calculate information gain based on Gini impurity and partition of data 6. There are two components that define a Bayesian belief network :
from previous step. a. Directed acyclic graph :
6. Update highest information gain based on each question asked. i. Each node in a directed acyclic graph represents a random
variable.
7 Update question based on information gain (higher information gain).
ii. These variable may be discrete or continuous valued.
8. Divide the node on question. Repeat again from step 1 until we get pure
node (leaf nodes). iii. These variables may correspond to the actual attribute given
in the data.
Que 1.21. What are the advantages and disadvantages of decision Directed acyclic graph representation : The following diagram shows a
tree method ? directed acyclic graph for six Boolean variables.
i. The arc in the diagram allows representation of causal
Answer knowledge.
Advantages of decision tree method are : ii. For example, lung cancer is influenced by a person’s family
1. Decision trees are able to generate understandable rules. history of lung cancer, as well as whether or not the person is
a smoker.
2. Decision trees perform classification without requiring computation.
3. Decision trees are able to handle both continuous and categorical
Family History Smoker
variables.
4. Decision trees provide a clear indication for the fields that are important
for prediction or classification.
Disadvantages of decision tree method are : Lung Cancer Emphysema
1. Decision trees are less appropriate for estimation tasks where the goal
is to predict the value of a continuous attribute.
2. Decision trees are prone to errors in classification problems with many
Positive Xray Dyspnea
class and relatively small number of training examples.
3. Decision tree are computationally expensive to train. At each node,
iii. It is worth noting that the variable Positive X-ray is independent
each candidate splitting field must be sorted before its best split can be
of whether the patient has a family history of lung cancer or
found.
that the patient is a smoker, given that we know the patient
4. In decision tree algorithms, combinations of fields are used and a search has lung cancer.
must be made for optimal combining weights. Pruning algorithms can
b. Conditional probability table :
also be expensive since many candidate sub-trees must be formed and
compared. The conditional probability table for the values of the variable
LungCancer (LC) showing each possible combination of the values
Que 1.22. Write short note on Bayesian belief networks. of its parent nodes, FamilyHistory (FH), and Smoker (S) is as follows :
FH,S FH,-S -FH,S -FH,S
Answer
LC 0.8 0.5 0.7 0.1
1. Bayesian belief networks specify joint conditional probability distributions.
2. They are also known as belief networks, Bayesian networks, or -LC 0.2 0.5 0.3 0.9
probabilistic networks.
Machine Learning Techniques 1–23 L (CS/IT-Sem-5) Introduction 1–24 L (CS/IT-Sem-5)

Que 1.23. Write a short note on support vector machine.


Start
Answer
1. A Support Vector Machine (SVM) is machine learning algorithm that
analyzes data for classification and regression analysis. Initialization

2. SVM is a supervised learning method that looks at data and sorts it into
Initial population
one of two categories.
3. An SVM outputs a map of the sorted data with the margins between the Selection
two as far apart as possible.
New population
4. Applications of SVM :

Old population
i. Text and hypertext classification
ii. Image classification Yes
Quit ?
iii. Recognizing handwritten characters
iv. Biological sciences, including protein classification
NO
Que 1.24. Explain genetic algorithm with flow chart.
Crossover
Answer
Genetic algorithm (GA) :
1. The genetic algorithm is a method for solving both constrained and Mutation
unconstrained optimization problems that is based on natural selection.
2. The genetic algorithm repeatedly modifies a population of individual
solutions.
End
3. At each step, the genetic algorithm selects individuals at random from
the current population to be parents and uses them to produce the Fig. 1.24.1.
children for the next generation.
4. Over successive generations, the population evolves toward an optimal
solution. PART-4
Flow chart : The genetic algorithm uses three main types of rules at each Issues in Machine Learning and Data Science Vs. Machine Learning.
step to create the next generation from the current population :
a. Selection rule : Selection rules select the individuals, called parents,
that contribute to the population at the next generation. Questions-Answers
b. Crossover rule : Crossover rules combine two parents to form children
for the next generation. Long Answer Type and Medium Answer Type Questions
c. Mutation rule : Mutation rules apply random changes to individual
parents to form children.
Que 1.25. Briefly explain the issues related with machine
learning.
Machine Learning Techniques 1–25 L (CS/IT-Sem-5) Introduction 1–26 L (CS/IT-Sem-5)

3. Clustering :
Answer
a. In clustering data is not labelled, but can be divided into groups
Issues related with machine learning are : based on similarity and other measures of natural structure in the
1. Data quality : data.

a. It is essential to have good quality data to produce quality ML b. For example, organising pictures by faces without names, where
the human user has to assign names to groups, like iPhoto on the
algorithms and models.
Mac.
b. To get high-quality data, we must implement data evaluation,
integration, exploration, and governance techniques prior to 4. Rule extraction :
a. In rule extraction, data is used as the basis for the extraction of
developing ML models.
propositional rules.
c. Accuracy of ML is driven by the quality of the data.
b. These rules discover statistically supportable relationships between
2. Transparency : attributes in the data.
a. It is difficult to make definitive statements on how well a model is
going to generalize in new environments. Que 1.27. Differentiate between data science and machine
3. Manpower : learning.
a. Manpower means having data and being able to use it. This does Answer
not introduce bias into the model.
b. There should be enough skill sets in the organization for software S. No. Data science Machine learning
development and data collection. 1. Data science is a concept used Machine learning is defined as
4. Other : to tackle big data and includes the practice of using algorithms
a. The most common issue with ML is people using it where it does data cleansing, preparation, to use data, learn from it and
not belong. and analysis. then forecast future trends for
that topic.
b. Every time there is some new innovation in ML, we see overzealous
engineers trying to use it where it’s not really necessary. 2. It includes vario us data It includes subset of Artificial
c. This used to happen a lot with deep learning and neural networks. operations. Intelligence.
d. Traceability and reproduction of results are two main issues. 3. Data science works by Machine learning uses efficient
so urcing, cleaning, and programs that can use data
Que 1.26. What are the classes of problem in machine learning ? processing data to extract without being explicitly told to
me aning out of it fo r do so.
Answer analytical purposes.
Common classes of problem in machine learning :
4. SAS, Tableau, Apache, Spark, Amazo n Le x, IBM Watso n
1. Classification : MATLAB are the tools used Studio, Microsoft Azure ML
a. In classification data is labelled i.e., it is assigned a class, for example, in data science. Studio are the tools used in ML.
spam/non-spam or fraud/non-fraud.
5. Data science deals with Machine learning uses statistical
b. The decision being modelled is to assign labels to new unlabelled
structured and unstructured models.
pieces of data.
data.
c. This can be thought of as a discrimination problem, modelling the
differences or similarities between groups. 6. Fraud de te ctio n and Recommendation systems such
2. Regression : he althcare analysis are as Spotify and Facial Recognition
examples of data science. are examples o f machine
a. Regression data is labelled with a real value rather than a label. learning.
b. The decision being modelled is what value to predict for new
unpredicted data. 
Machine Learning Techniques 2–1 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–2 L (CS/IT-Sem-5)

2
PART-1
Regression, Linear Regression and Logistic Regression.

Regression and Questions-Answers


Bayesian Learning Long Answer Type and Medium Answer Type Questions

Que 2.1. Define the term regression with its type.

CONTENTS Answer
1. Regression is a statistical method used in finance, investing, and other
disciplines that attempts to determine the strength and character of the
Part-1 : Regression, Linear Regression ................ 2–2L to 2–4L relationship between one dependent variable (usually denoted by Y)
and Logistic Regression and a series of other variables (known as independent variables).
Part-2 : Bayesian Learning, Bayes ...................... 2–4L to 2–19L 2. Regression helps investment and financial managers to value assets
Theorem, Concept Learning, and understand the relationships between variables, such as commodity
Bayes Optimal Classifier, Naive prices and the stocks of businesses dealing in those commodities.
Bayes Classifier, Bayesian There are two type of regression :
Belief Networks, EM Algorithm
a. Simple linear regression : It uses one independent variable to
Part-3 : Support Vector Machine, ...................... 2–20L to 2–24L explain or predict the outcome of dependent variable Y.
Introduction, Types of Support Y = a + bX + u
Vector Kernel - (Linear Kernel b. Multiple linear regression : It uses two or more independent
Polynomial Kernel, and Gaussian
variables to predict outcomes.
Kernel), Hyperplane-
(Decision Surface), Properties Y = a + b1X1 + b2X2 + b3X3 + ... + btXt + u
of SVM, and Issues in SVM Where :
Y = The variable we you are trying to predict (dependent variable).
X = The variable that we are using to predict Y (independent variable).
a = The intercept.
b = The slope.
u = The regression residual.
Que 2.2. Describe briefly linear regression.

Answer
1. Linear regression is a supervised machine learning algorithm where
the predicted output is continuous and has a constant slope.
2. It is used to predict values within a continuous range, (for example :
sales, price) rather than trying to classify them into categories (for
example : cat, dog).
Machine Learning Techniques 2–3 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–4 L (CS/IT-Sem-5)

3. Following are the types of linear regression : 3. Ordinal regression :


a. Simple regression : a. In this classification, dependent variable can have three or more
i. Simple linear regression uses traditional slope-intercept form to produce possible ordered types or the types having a quantitative significance.
accurate prediction, y = mx + b b. For example, these variables may represent “poor” or “good”, “very
where, m and b are the variables, good”, “Excellent” and each category can have the scores like 0, 1, 2, 3.
x represents our input data and y represents our prediction.
Que 2.5. Differentiate between linear regression and logistics
b. Multivariable regression :
i. A multi-variable linear equation is given below, where w represents the regression.
coefficients, or weights :
f(x, y, z) = w1x + w2y + w3z Answer
ii. The variables x, y, z represent the attributes, or distinct pieces of
information that, we have about each observation. S. No. Linear regression Logistics regression
iii. For sales predictions, these attributes might include a company’s 1. Linear re gression is a Logistic regression is a supervised
advertising spend on radio, TV, and newspapers. supervised regression model. classification model.
Sales = w1 Radio + w2 TV + w3 Newspapers
2. In Line ar regre ssio n, we In Logistic regression, we predict
Que 2.3. Explain logistics regression. predict the value by an integer the value by 1 or 0.
number.
Answer
3. No activation function is used. Activation function is used to
1. Logistic regression is a supervised learning classification algorithm used co nvert a line ar regre ssio n
to predict the probability of a target variable. equation to the logistic regression
2. The nature of target or dependent variable is dichotomous, which means equation.
there would be only two possible classes.
4. A threshold value is added. No threshold value is needed.
3. The dependent variable is binary in nature having data coded as either
1 (stands for success/yes) or 0 (stands for failure/no). 5. It is based on the least square The dependent variable consists
estimation. of only two categories.
4. A logistic regression model predicts P(Y = 1) as a function of X. It is one
of the simplest ML algorithms that can be used for various classification 6. Linear regression is used to Logistic regression is used to
problems such as spam detection, diabetes prediction, cancer detection e stimate the de pe ndent calculate the probability of an
etc. variable in case of a change in event.
independent variables.
Que 2.4. What are the types of logistics regression ?
7. Linear regression assumes the Logistic regression assumes the
Answer normal or gaussian bino mial distribution of the
distribution of the dependent dependent variable.
Logistics regression can be divided into following types :
variable.
1. Binary (Binomial) Regression :
a. In this classification, a dependent variable will have only two possible
types either 1 and 0. PART-2
b. For example, these variables may represent success or failure, yes
Bayesian Learning, Bayes Theorem, Concept Learning,
or no, win or loss etc.
Bayes Optimal Classifier, Naive Bayes Classifier, Bayesian
2. Multinomial regression :
Belief Networks, EM Algorithm.
a. In this classification, dependent variable can have three or more
possible unordered types or the types having no quantitative
significance.
b. For example, these variables may represent “Type A” or “Type B” or
“Type C”.
Machine Learning Techniques 2–5 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–6 L (CS/IT-Sem-5)

8. Now, the Baye’s classification rule can be defined as :


Questions-Answers a. If p(1 |x) > p(2 |x) x is classified to 1
b. If p(|x) < p(2 |x) x is classified to 2 ...(2.6.3)
Long Answer Type and Medium Answer Type Questions
9. In the case of equality the pattern can be assigned to either of the two
classes. Using equation (2.6.1), decision can equivalently be based on
the inequalities :
Que 2.6. Explain Bayesian learning. Explain two category
a. p(x| 1) p(1) > p(x| 2)p(2)
classification.
b. p(x| 1) p(1) < p(x| 2) p(2) ...(2.6.4)
Answer 10. Here p(x) is not taken because it is same for all classes and it does not
affect the decision.
Bayesian learning :
11. Further, if the priori probabilities are equal, i.e.,
1. Bayesian learning is a fundamental statistical approach to the problem
of pattern classification. a. p(1) = p(2) = 1/2 then Eq. (2.6.4) becomes,
2. This approach is based on quantifying the tradeoffs between various b. p(x| 1) > p(x| 2)
classification decisions using probability and costs that accompany such c. p(x| 1) < p(x| 2)
decisions.
12. For example, in Fig. 2.6.1, two equiprobable classes are presented which
3. Because the decision problem is solved on the basis of probabilistic terms, shows the variations of p(x|i), i = 1, 2 as functions of x for the simple
hence it is assumed that all the relevant probabilities are known. case of a single feature (l = 1).
4. For this we define the state of nature of the things present in the 13. The dotted line at x0 is a threshold which partitions the space into two
particular pattern. We denote the state of nature by . regions, R1 and R2. According to Baye’s decisions rule, for all value of x
Two category classification : in R1 the classifier decides 1 and for all values in R2 it decides 2.

1. Let 1, 2 be the two classes of the patterns. It is assumed that the a 14. From the Fig. 2.6.1, it is obvious that the errors are unavoidable. There
priori probabilities p(1) and p(2) are known. is a finite probability for an x to lie in the R2 region and at the same time
to belong in class 1. Then there is error in the decision.
2. Even if they are not known, they can easily be estimated from the
available training feature vectors. p ( x | ) p(x|1) p(x|2)
3. If N is total number of available training patterns and N1, N2 of them
belong to 1 and 2, respectively then p(1)  N1/N and p(2)  N2/N. Shade the part
4. The conditional probability density functions p(x| i), i = 1, 2 is also
assumed to be known which describes the distribution of the feature
vectors in each of the classes.
5. The feature vectors can take any value in the l-dimensional feature
space.
6. Density functions p(x|i) become probability and will be denoted by
p(x| i) when the feature vectors can take only discrete values. x
R1 x0 R2
7. Consider the conditional probability,
Fig. 2.6.1. Bayesian classifier for the case of two equiprobable classes.
p( x| i ) p(i )
p  i | x  = ...(2.6.1) 15. The total probability, P of committing a decision error for two
p(x)
equiprobable classes is given by,
where p(x) is the probability density function of x and for which we have
2 x0 1 
1 1
p(x) =  p  x| i  p  i  ...(2.6.2) Pe =  p  x| 2  dx   p  x|1  dx
i 1
2 2
 x
0
Machine Learning Techniques 2–7 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–8 L (CS/IT-Sem-5)

which is equal to the total shaded area under the curves in Fig. 2.6.1. 12. In a classification task with M classes, 1, 2,..., M an unknown pattern,
represented by the feature vector x, is assigned to class i if p(i|x) >
Que 2.7. Explain how the decis ion error for Bayes ian
p(j|x)  j  i.
classification can be minimized.
Que 2.8. Consider the Bayesian classifier for the uniformly
Answer
distributed classes, where :
1. Bayesian classifier can be made optimal by minimizing the classification
 1
error probability.  , x  [ a1 , a2 ]
P(x/w1) =  a2  a1
2. In Fig. 2.7.1, it is observed that when the threshold is moved away from 
x0, the corresponding shaded area under the curves always increases.  0 , muullion
3. Hence, we have to decrease this shaded area to minimize the error.  1
 , x [ b1 , b2 ]
4. Let R 1 be the region of the feature space for  1 and R 2 be the P(x/w2) =  b2  b1
corresponding region for 2. 
 0 , muullion
5. Then an error will be occurred if, x R1 although it belongs to 2 or if x Show th e cl assification resul ts for s ome values for a and b
R2 although it belongs to 1 i.e., (“muullion” means “otherwise”).
Pe = p(xR2 ,1) + p(xR1, 2) ...(2.7.1) Answer
6. Pe can be written as, Typical cases are presented in the Fig. 2.8.1.
Pe = p(xR2|1) p(1) + p(xR1|2) p(2) P(x|yj) P(x|yj)
1 1
 
1 1
= P(1) p( x |1 ) dx  p( 2 ) p( x| 2 ) dx ...(2.7.2) a2 – a1 b2 – b 1 a2 – a1 b2 – b1
R R
2 1
7. Using the Baye’s rule,
a1 a2 b b2 a1 b1 a2 b2
1
=P  p(1 | x) p( x)dx   p( 2 | x) p( x) dx ...(2.7.3)
P(x|yj)
(a)
P(x|yj)
(b)
R R
2 1 1
1
8. The error will be minimized if the partitioning regions R1 and R2 of the b2 – b1 b2 – b 1
feature space are chosen so that 1 1
a2 – a1 a2 – a1
R1 : p(1|x) > p(2|x)
R2 : p(2|x) > p(1 |x) ...(2.7.4)
9. Since the union of the regions R1, R2 covers all the space, we have a1 b1 a2 a1 b1 b2 a2
b2
(c) (d)
 p(1 | x) p(x)dx   p(1 | x) p(x) dx = 1 ...(2.7.5) Fig. 2.8.1.
R1 R2
10. Combining equation (2.7.3) and (2.7.5), we get, Que 2.9. Define Bayes classifier. Explain how classification is
done by using Bayes classifier.
Pe = p(w1)  ( p(1 x)  p( 2 x)) p(x)dx ... (2.7.6)
R1 Answer
11. Thus, the probability of error is minimized if R1 is the region of space in 1. A Bayes classifier is a simple probabilistic classifier based on applying
which p(1|x) > p(2|x). Then R2 becomes region where the reverse is Bayes theorem (from Bayesian statistics) with strong (Naive)
true. independence assumptions.
Machine Learning Techniques 2–9 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–10 L (CS/IT-Sem-5)

2. A Naive Bayes classifier assumes that the presence (or absence) of a


particular feature of a class is unrelated to the presence (or absence) of Assign x to class 1
Input vector Likelihood
any other feature. (x) if  (x) > 
x ratio Comparator
Otherwise, assign
3. Depending on the precise nature of the probability model, Naive Bayes computer
it to class 2
classifiers can be trained very efficiently in a supervised learning.
4. In many practical applications, parameter estimation for Naive Bayes (a) 
models uses the method of maximum likelihood; in other words, one Assign x to class 1
can work with the Naive Bayes model without believing in Bayesian Input vector Likelihood log(x)
x if log  (x) > log 
probability or using any Bayesian methods. ratio Comparator
Otherwise, assign
computer
5. An advantage of the Naive Bayes classifier is that it requires a small it to class 2
amount of training data to estimate the parameters (means and
(b ) log
variances of the variables) necessary for classification.
Fig. 2.9.1. Two equivalent implementations of the Bayes classifier :
6. The perceptron bears a certain relationship to a classical pattern ( a) Likelihood ratio test, ( b) Log-likelihood ratio test
classifier known as the Bayes classifier.
7. When the environment is Gaussian, the Bayes classifier reduces to a
linear classifier. Que 2.10. Discuss Bayes classifier using some example in detail.
In the Bayes classifier, or Bayes hypothesis testing procedure, we
minimize the average risk, denoted by R. For a two-class problem, Answer
represented by classes C1 and C2, the average risk is defined : Bayes classifier : Refer Q. 2.9, Page 2–8L, Unit-2.

 P (x / C )dx  C P  P (x / C )dx
For example :
R = C11 P1 x 1 22 2 x 2
H1 H2
1. Let D be a training set of features and their associated class labels. Each
feature is represented by an n-dimensio nal attribute vector
 C21 P1
 P (x / C )dx  C P  P (x / C )dx
H2
x 1 12 2
H1
x 2
X = (x1, x2, ...., xn) depicting n measurements made on the feature from
n attributes, respectively A1, A2, ....., An.
where the various terms are defined as follows : 2. Suppose that there are m classes, C1, C2,..., Cm. Given a feature X, the
classifier will predict that X belongs to the class having the highest
Pi = Prior probability that the observation vector x is drawn from
posterior probability, conditioned on X. That is, classifier predicts that X
subspace Hi, with i = 1, 2, and P1 + P2 = 1
belongs to class Ci if and only if,
Cij = Cost of deciding in favour of class Ci represented by subspace Hi
when class Cj is true, with i, j = 1, 2 p(Ci|X) > p(Cj|X) for 1  j  m, j  i
Px (x/Ci) = Conditional probability density function of the random vector X Thus, we maximize p(Ci|X). The class Ci for which p(Ci|X) is maximized
is called the maximum posterior hypothesis. By Bayes theorem,
8. Fig. 2.9.1(a) depicts a block diagram representation of the Bayes classifier.
p( X |Ci ) p(Ci )
The important points in this block diagram are two fold : p(Ci|X) =
p(X)
a. The data processing in designing the Bayes classifier is confined
entirely to the computation of the likelihood ratio (x). 3. As p(X) is constant for all classes, only P(X| Ci) P(Ci) need to be
b. This computation is completely invariant to the values assigned to maximized. If the class prior probabilities are not known then it is
the prior probabilities and involved in the decision-making process. commo nly assume d that the classe s are e qually likely i.e.,
These quantities merely affect the values of the threshold x. p(C1) = p(C2) = .... p(Cm) and therefore p(X|Ci) is maximized. Otherwise
p(X|Ci) p(Ci) is maximized.
c. From a computational point of view, we find it more convenient to
work with logarithm of the likelihood ratio rather than the 4. i. Given data sets with many attributes, the computation of p(X|Ci)
likelihood ratio itself. will be extremely expensive.
ii. To reduce computation in evaluating p(X|Ci), the assumption of
class conditional independence is made.
Machine Learning Techniques 2–11 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–12 L (CS/IT-Sem-5)

iii. This presumes that the values of the attributes are conditionally
independent of one another, given the class label of the feature. Que 2.11. Let blue, green, and red be three classes of objects with
n prior probabilities given by P(blue) = 1/4, P(green) = 1/2, P(red) = 1/4.
Thus, p(X|Ci) =  p( xk C i ) Let there be three types of objects pencils, pens, and paper. Let the
k 1 class-conditional probabilities of these objects be given as follows.
= p(x1|C2)  p (x2|C2)x.... × p(xn|Ci) Use Bayes classifier to classify pencil, pen and paper.
iv. The probabilities p(x1|Ci), p(x2|Ci),...., p(xn|Ci) are easily estimated P(pencil/green) = 1/3 P(pen/green) = 1/2 P(paper/green) = 1/6
from the training feature. Here xk refers to the value of attribute P(pencil/blue) = 1/2 P(pen/blue) = 1/6 P(paper/blue) = 1/3
Ak for each attribute, it is checked whether the attribute is P(pencil/red) = 1/6 P(pen/red) = 1/3 P(paper/red) = 1/2
categorical or continuous valued.
Answer
v. For example, to compute p(X|Ci) we consider,
As per Bayes rule :
a. If Ak is categorical then p(xk|Ci) is the number of feature of
class Ci in D having the value xk for Ak divided by |Ci, D|, the P(pencil/ green) P(green)
P(green/pencil) =
number of features of class Ci in D. (P(pencil/ green) P(green) + P(pencil/ blue)
b. If Ak is continuous valued then continuous valued attribute is P(blue) + P(pencil/ red) P(red)
typically assumed to have a Gaussian distribution with a mean 1 1 1

and standard deviation , defined by, 3 2
=  6 = 0.5050
 1 1 1 1 1 1  0.33
  x   2        
 1  3 2 2 4 6 4
1  2 2 
g(x) = e   P(pencil/ blue) P(blue)
P(blue/pencil) =
2 (P(pencil/ green) P(green) + P(pencil/ blue)
so that p(xk|Ci) = g(xk). P(blue) + P(pencil/ red) P(red)
vi. There is a need to compute the mean  and the standard deviation 1 1

 of the value of attribute Ak for training set of class Ci. These
= 2 4 = 0.378
values are used to estimate p(xk|Ci). 0.33
vii. For example, let X = (35, Rs. 40,000) where A1 and A2 are the P(pencil/ red) P(red)
P(red/pencil) =
attributes age and income, respectively. Let the class label attribute (P(pencil/ red) P(red) + P(pencil/ blue)
be buys-computer. P(blue) + P(pencil/ green) P(green)
viii. The associated class label for X is yes (i.e., buys-computer = yes). 1 1 1
Let’s suppose that age has not been discretized and therefore exists 
= 6 4  24 = 0.126
as a continuous valued attribute. 0.33 0.33
ix. Suppose that from the training set, we find that customer in D who Since, P(green/pencil) has the highest value therefore pencil belongs to
buy a computer are 38 ± 12 years of age. In other words, for attribute class green.
age and this class, we have  = 38 and  = 12. P(pen/ green) P(green)
P(green/pen) =
5. In order to predict the class label of X, p(X|Ci) p(Ci) is evaluated for each P(pen/ green) P(green) + P(pen/ blue)
class Ci. The classifier predicts that the class label of X is the class Ci, if P(blue)  P(pen/ red) P(red)
and only if
1 1 1
p(X|Ci) P(Ci) > p(X|Cj) p(Cj) for 1  j  m, j i, 
= 2 2  4 = 0.666
The predicted class label is the class Ci for which p(X|Ci) P(Ci) is the 1 1 1 1 1 1 0.375
    
maximum. 2 2 6 4 3 4
Machine Learning Techniques 2–13 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–14 L (CS/IT-Sem-5)

P(pen/ blue)P(blue)
P(blue/pen) = Answer
P(pen/ green) P(green) + P(pen/ blue)
1. Naive Bayes model is the most common Bayesian network model used
P(blue) + P(pen/ red) P(red)
in machine learning.
1 1 1
 2. Here, the class variable C is the root which is to be predicted and the
= 6 4  24 = 0.111 attribute variables Xi are the leaves.
0.375 0.375
3. The model is Naive because it assumes that the attributes are
P(pen/ red) P(red) conditionally independent of each other, given the class.
P(red/pen) =
P(pen/ green) P(green) + P(pen/ blue)
1
P(blue) + P(pen/ red) P(red)

Proportion correct on test set


1 1 1 0.9

= 3 4  12 = 0.222
0.375 0.375 0.8
Since P(green/pen) has the highest value therefore, pen belongs to
class green. 0.7
P(paper/ green) P(green)
P(green/paper) = 0.6
P(paper/ green) P(green) + P(paper/ blue)
Decision tree
P(blue) + P(paper/ red) P(red) Naive Bayes
0.5
1 1 1

= 6 2  12 0.4
1 1 1 1 1 1 1 1 1 0 20 40 60 80 100
      
6 2 3 4 2 4 12 12 8 Training set size
1 Fig. 2.12.1. The learning curve for Naive Bayes learning.

= 12 = 0.286 4. Assuming Boolean variables, the parameters are :


0.291
 = P(C = true), i1 = P(Xi = true|C = true),
P(paper/ blue) P(blue)
P(blue/paper) = i2 = P(Xi = true | C = False)
P(paper/ green) P(green) + P(paper/ blue)
P(blue) + P(paper/ red) P(red) 5. Naive Bayes models can be viewed as Bayesian networks in which each
Xi has C as the sole parent and C has no parents.
1 1 1
 6. A Naive Bayes model with gaussian P(Xi|C) is equivalent to a mixture
= 3 4  12 = 0.286 of gaussians with diagonal covariance matrices.
0.291 0.291
P(paper/ red) P(red) 7. While mixtures of gaussians are used for density estimation in continuous
P(red/paper) = domains, Naive Bayes models used in discrete and mixed domains.
P(paper/ green) P(green) + P(paper/ blue)
8. Naive Bayes models allow for very efficient inference of marginal and
P(blue) + P(paper/ red) P(red)
conditional distributions.
1 1 1
 9. Naive Bayes learning has no difficulty with noisy data and can give
= 2 4  8 = 0.429 more appropriate probabilistic predictions.
0.291 0.291
Since, P(red/paper) has the highest value therefore, paper belongs to Que 2.13. Consider a two-class (Tasty or non-Tasty) problem with
class red.
the following training data. Use Naive Bayes classifier to classify
Que 2.12. Explain Naive Bayes classifier. the pattern :
‘‘Cook = Asha, Health-Status = Bad, Cuisine = Continental’’.
Machine Learning Techniques 2–15 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–16 L (CS/IT-Sem-5)

Cook Health-Status Cuisine Tasty 2 2 2 6


Likelihood of yes =    = 0.023
Asha Bad Indian Yes 6 6 6 10
Asha Good Continental Yes
3 3 4
Sita Bad Indian No Likelihood of no = 0    =0
4 4 10
Sita Good Indian Yes
Therefore, the prediction is tasty.
Usha Bad Indian Yes
Usha Bad Continental No Que 2.14. Explain EM algorithm with steps.
Sita Bad Continental No
Answer
Sita Good Continental Yes
1. The Expectation-Maximization (EM) algorithm is an iterative way to
Usha Good Indian Yes find maximum-likelihood estimates for model parameters when the
Usha Good Continental No data is incomplete or has missing data points or has some hidden variables.
2. EM chooses random values for the missing data points and estimates a
Answer new set of data.
3. These new values are then recursively used to estimate a better first
Cook Health- Cuisine data, by filling up missing points, until the values get fixed.
status
4. These are the two basic steps of the EM algorithm :
Yes No Yes No Yes No
a. Estimation Step :
Asha 2 0 Bad 2 3 Indian 4 1
i. Initialize k, k and k by random values, or by K means clustering
Sita 2 2 Good 4 1 Continental 2 3 results or by hierarchical clustering results.
Usha 2 2 ii. Then for those given parameter values, estimate the value of
the latent variables (i.e., k).
Tasty
b. Maximization Step : Update the value of the parameters (i.e.,k,
Yes No k and k ) calculated using ML method :
6 4 i. Initialize the mean k,
the covariance matrix k and
Cook Health- Cuisine the mixing coefficients k
status by random values, (or other values).
Yes No Yes No Yes No ii. Compute the k values for all k.
Asha 2/6 0 Bad 2/6 3/4 Indian 4/6 1/4 iii. Again estimate all the parameters using the current k values.
Sita 2/6 2/4 Good 4/6 1/4 Continental 2/6 3/4 iv. Compute log-likelihood function.
Usha 2/6 2/4 v. Put some convergence criterion.
vi. If the log-likelihood value converges to some value
Tasty
(or if all the parameters converge to some values) then stop,
Yes No
else return to Step 2.
6/10 4/10
Machine Learning Techniques 2–17 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–18 L (CS/IT-Sem-5)

3. A Bayesian network provides a complete description of the domain.


Que 2.15. Describe the usage, advantages and disadvantages of Every entry in the full joint probability distribution can be calculated
EM algorithm. from the information in the network.
4. Bayesian networks provide a concise way to represent conditional
Answer
independence relationships in the domain.
Usage of EM algorithm :
5. A Bayesian network is often exponentially smaller than the full joint
1. It can be used to fill the missing data in a sample. distribution.
2. It can be used as the basis of unsupervised learning of clusters. For example :
3. It can be used for the purpose of estimating the parameters of Hidden 1. Suppose we want to determine the possibility of grass getting wet or dry
Markov Model (HMM). due to the occurrence of different seasons.
4. It can be used for discovering the values of latent variables. 2. The weather has three states : Sunny, Cloudy, and Rainy. There are
Advantages of EM algorithm are : two possibilities for the grass : Wet or Dry.
1. It is always guaranteed that likelihood will increase with each iteration. 3. The sprinkler can be on or off. If it is rainy, the grass gets wet but if it is
sunny, we can make grass wet by pouring water from a sprinkler.
2. The E-step and M-step are often pretty easy for many problems in terms
of implementation. 4. Suppose that the grass is wet. This could be contributed by one of the
two reasons - Firstly, it is raining. Secondly, the sprinklers are turned
3. Solutions to the M-steps often exist in the closed form. on.
Disadvantages of EM algorithm are : 5. Using the Baye’s rule, we can deduce the most contributing factor
1. It has slow convergence. towards the wet grass.
2. It makes convergence to the local optima only.
3. It requires both the probabilities, forward and backward (numerical Condition
optimization requires only forward probability).

Sprinkler Rain
Que 2.16. Write a short note on Bayesian network.
OR
Explain Bayesian network by taking an example. How is the Bayesian Wet grass
network powerful representation for uncertainty knowledge ?
Fig. 2.16.1.
Answer
1. A Bayesian network is a directed acyclic graph in which each node is Bayesian network possesses the following merits in uncertainty
annotated with quantitative probability information. knowledge representation :

2. The full specification is as follows : 1. Bayesian network can conveniently handle incomplete data.

i. A set of random variables makes up the nodes of the network 2. Bayesian network can learn the casual relation of variables. In data
variables may be discrete or continuous. analysis, casual relation is helpful for field knowledge understanding, it
can also easily lead to precise prediction even under much interference.
ii. A set of directed links or arrows connects pairs of nodes. If there is
an arrow from x to node y, x is said to be a parent of y. 3. The combination of bayesian network and bayesian statistics can take
full advantage of field knowledge and information from data.
iii. Each node x i has a co nditio nal pro bability distributio n
P(xi|parent(xi)) that quantifies the effect of parents on the node. 4. The combination of bayesian network and other models can effectively
avoid over-fitting problem.
iv. The graph has no directed cycles (and hence is a directed acyclic
graph or DAG).
Machine Learning Techniques 2–19 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–20 L (CS/IT-Sem-5)

Que 2.17.
3.17. Explain the role of prior probability and posterior
probability in bayesian classification.
PART-3
Support Vector Machine, Introduction, Types of Support
Answer Vector Kernel - (Linear Kernel Polynomial Kernel, and Gaussian
Role of prior probability : Kernel), Hyperplane : (Decision Surface), Properties
1. The prior probability is used to compute the probability of the event of SVM, and Issues in SVM.
before the collection of new data.
2. It is used to capture our assumptions / domain knowledge and is
independent of the data. Questions-Answers
3. It is the unconditional probability that is assigned before any relevant
evidence is taken into account. Long Answer Type and Medium Answer Type Questions

Role of posterior probability :


1. Posterior probability is used to compute the probability of an event after
Que 2.19. Write short note on support vector machine.
collection of data.
2. It is used to capture both the assumptions / domain knowledge and the
pattern in observed data. Answer
3. It is the conditional probability that is assigned after the relevant evidence Refer Q. 1.23, Page 1–23L, Unit-1.
or background is taken into account.
Que 2.20. What are the types of support vector machine ?
Que 2.18. Explain the method of handling approximate inference
in Bayesian networks. Answer

Answer Following are the types of support vector machine :


1. Linear SVM : Linear SVM is used for linearly separable data, which
1. Approximate inference methods can be used when exact inference
means if a dataset can be classified into two classes by using a single
methods lead to unacceptable computation times because the network
straight line, then such data is termed as linearly separable data, and
is very large or densely connected.
classifier is used called as Linear SVM classifier.
2. Methods handling approximate inference :
2. Non-linear SVM : Non-Linear SVM is used for non-linearly separated
i. Simulation methods : This method use the network to generate data, which means if a dataset cannot be classified by using a straight
samples from the conditional probability distribution and estimate line, then such data is termed as non-linear data and classifier used is
conditional probabilities of interest when the number of samples called as Non-linear SVM classifier.
is sufficiently large.
ii. Variational methods : This method express the inference task Que 2.21. What is polynomial kernel ? Explain polynomial kernel
as a numerical optimization problem and then find upper and using one dimensional and two dimensional.
lower bounds of the probabilities of interest by solving a simplified
version of this optimization problem. Answer

1. The polynomial kernel is a kernel function used with Support Vector


Machines (SVMs) and other kernelized models, that represents the
Machine Learning Techniques 2–21 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–22 L (CS/IT-Sem-5)

similarity of vectors (training samples) in a feature space over polynomials


of the original variables, allowing learning of non-linear models. Answer
2. Polynomial kernel function is given by the equation : 1. A hyperplane in an n-dimensional Euclidean space is a flat, n-1
(a × b + r)d dimensional subset of that space that divides the space into two
disconnected parts.
where, a and b are two different data points that we need to classify.
2. For example let’s assume a line to be one dimensional Euclidean space.
r determines the coefficients of the polynomial.
3. Now pick a point on the line, this point divides the line into two parts.
d determines the degree of the polynomial.
4. The line has 1 dimension, while the point has 0 dimensions. So a point is
3. We perform the dot products of the data points, which gives us the high a hyperplane of the line.
dimensional coordinates for the data.
5. For two dimensions we saw that the separating line was the hyperplane.
4. When d = 1, the polynomial kernel computes the relationship between
each pair of observations in 1-Dimension and these relationships help to 6. Similarly, for three dimensions a plane with two dimensions divides the
find the support vector classifier. 3d space into two parts and thus act as a hyperplane.

5. When d = 2, the polynomial kernel computes the 2-Dimensional 7. Thus for a space of n dimensions we have a hyperplane of n-1 dimensions
relationship between each pair of observations which help to find the separating it into two parts.
support vector classifier.
Que 2.24. What are the advantages and disadvantags of SVM ?
Que 2.22. Describe Gaussian Kernel (Radial Basis Function).
Answer
Answer
Advantages of SVM are :
1. RBF kernel is a function whose value depends on the distance from the 1. Guaranteed optimality : Owing to the nature of Convex Optimization,
origin or from some point. the solution will always be global minimum, not a local minimum.
2. Gaussian Kernel is of the following format : 2. The abundance of implementations : We can access it conveniently.

K(X1, X2, ) = exponent (–   X1 – X 2 2 ) 3. SVM can be used for linearly separable as well as non-linearly separable
data. Linearly separable data pases hard margin whereas non-linearly
separable data poses a soft margin.
 X1 – X 2  = Euclidean distance between X1 and X2
4. SVMs provide compliance to the semi-supervised learning models. It
Using the distance in the original space we calculate the dot product can be used in areas where the data is labeled as well as unlabeled. It
(similarity) of X1 and X2. only requires a condition to the minimization problem which is known
3. Following are the parameters used in Gaussain Kernel: as the transductive SVM.
a. C : Inverse of the strength of regularization. 5. Feature Mapping used to be quite a load on the computational complexity
of the overall training performance of the model. However, with the
Behavior : As the value of ‘c’ increases the model gets overfits.
help of Kernel Trick, SVM can carry out the feature mapping using the
As the value of ‘c’ decreases the model underfits. simple dot product.
b.  : Gamma (used only for RBF kernel) Disadvantages of SVM :
Behavior : As the value of ‘’ increases the model gets overfits. 1. SVM does not give the best performance for handling text structures as
As the value of ‘’ decreases the model underfits. compared to other algorithms that are used in handling text data. This
leads to loss of sequential information and thereby, leading to worse
Que 2.23. Write short note on hyperplane (Decision surface). performance.
Machine Learning Techniques 2–23 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–24 L (CS/IT-Sem-5)

2. SVM cannot return the probabilistic confidence value that is similar to 3. The 'C' parameter :
logistic regression. This does not provide much explanation as the a. This parameter controls the amount of regularization applied on
confidence of prediction is important in several applications. the data.
3. The choice of the kernel is perhaps the biggest limitation of the support b. Large values of C mean low regularization which in turn causes
vector machine. Considering so many kernels present, it becomes difficult the training data to fit very well (may cause overfitting).
to choose the right one for the data.
c. Lower values of C mean higher regularization which causes the
Que 2.25. Explain the properties of SVM. model to be more tolerant of errors (may lead to lower accuracy).

Answer 
Following are the properties of SVM :
1. Flexibility in choosing a similarity function : Sparseness of
solution when dealing with large data sets only support vectors are used
to specify the separating hyperplane
2. Ability to handle large feature spaces : complexity does not depend
on the dimensionality of the feature space
3. Overfitting can be controlled by soft margin approach : A simple
convex optimization problem which is guaranteed to converge to a single
global solution

Que 2.26. What are the parameters used in support vector


classifier ?

Answer
Parameters used in support vector classifier are :
1. Kernel :
a. Kernel, is selected based on the type of data and also the type of
transformation.
b. By default, the kernel is Radial Basis Function Kernel (RBF).
2. Gamma :
a. This parameter decides how far the influence of a single training
example reaches during transformation, which in turn affects how
tightly the decision boundaries end up surrounding points in the
input space.
b. If there is a small value of gamma, points farther apart are considered
similar.
c. So, more points are grouped together and have smoother decision
boundaries (may be less accurate).
d. Larger values of gamma cause points to be closer together (may
cause overfitting).
Machine Learning Techniques 3–1 L (CS/IT-Sem-5) Decision Tree Learning 3–2 L (CS/IT-Sem-5)

3
PART-1
Decision Tree Learning, Decision Tree Learning Algorithm,
Inductive Bias, Inductive Inference with Decision Trees.

Decision Tree Learning Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 3.1. Describe the basic terminology used in decision tree.


CONTENTS
Answer
Part-1 : Decision Tree Learning, ............................ 3–2L to 3–6L Basic terminology used in decision trees are :
Decision Tree Learning
1. Root node : It represents entire population or sample and this further
Algorithm, Inductive Bias,
Inductive Inference with
gets divided into two or more homogeneous sets.
Decision Trees 2. Splitting : It is a process of dividing a node into two or more sub-nodes.
3. Decision node : When a sub-node splits into further sub-nodes, then it
Part-2 : Entropy and Information ........................ 3–6L to 3–12L is called decision node.
Theory, Information Gain, ID-3
Algorithm, Issues in Decision Root node
Tree Learning Branch/sub-tree
Splitting
Part-3 : Instance-based Learning, ...................... 3–12L to 3–15L
Decision node Decision node
Part-4 : K-Nearest Neighbour ............................ 3–16L to 3–20L
Learning, Locally Weighted
Regression, Radial Basis
Terminal node Decision node Terminal node Terminal node
Function Networks,

Part-5 : Case-based Learning. ............................. 3–20L to 3–27L


Terminal node Terminal node

Fig. 3.1.1.
4. Leaf / Terminal node : Nodes that do not split is called leaf or terminal
node.
5. Pruning : When we remove sub-nodes of a decision node, this process
is called pruning. This process is opposite to splitting process.
6. Branch / sub-tree : A sub section of entire tree is called branch or sub-
tree.
7. Parent and child node : A node which is divided into sub-nodes is
called parent node of sub-nodes where as sub-nodes are the child of
parent node.
Machine Learning Techniques 3–3 L (CS/IT-Sem-5) Decision Tree Learning 3–4 L (CS/IT-Sem-5)

Que 3.2. Why do we use decision tree ? Outlook

Answer
1. Decision trees can be visualized, simple to understand and interpret. Sunny Overcost Rain
2. They require less data preparation whereas other techniques often
require data normalization, the creation of dummy variables and removal Yes
of blank values.
Humidity Wind
3. The cost of using the tree (for predicting data) is logarithmic in the
number of data points used to train the tree.
4. Decision trees can handle both categorical and numerical data whereas Strong Weak
High Normal
other techniques are specialized for only one type of variable.
5. Decision trees can handle multi-output problems. No Yes No Yes
6. Decision tree is a white box model i.e., the explanation for the condition Fig. 3.3.1.
can be explained easily by Boolean logic because there are two outputs.
For example yes or no. Que 3.4. Explain various decision tree learning algorithms.
7. Decision trees can be used even if assumptions are violated by the
dataset from which the data is taken. Answer
Various decision tree learning algorithms are :
Que 3.3. How can we express decision trees ?
1. ID3 (Iterative Dichotomiser 3) :
Answer i. ID3 is an algorithm used to generate a decision tree from a dataset.
1. Decision trees classify instances by sorting them down the tree from the ii. To construct a decision tree, ID3 uses a top-down, greedy search
root to leaf node, which provides the classification of the instance. through the given sets, where each attribute at every tree node is
tested to select the attribute that is best for classification of a given
2. An instance is classified by starting at the root node of the tree, testing
set.
the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute as shown in Fig. 3.3.1. iii. Therefore, the attribute with the highest information gain can be
selected as the test attribute of the current node.
3. This process is then repeated for the subtree rooted at the new node.
iv. In this algorithm, small decision trees are preferred over the larger
4. The decision tree in Fig. 3.3.1 classifies a particular morning according
ones. It is a heuristic algorithm because it does not construct the
to whether it is suitable for playing tennis and returning the classification
smallest tree.
associated with the particular leaf.
v. For building a decision tree model, ID3 only accepts categorical
5. For example, the instance
attributes. Accurate results are not given by ID3 when there is
(Outlook = Rain, Temperature = Hot, Humidity = High, Wind = Strong) noise and when it is serially implemented.
would be sorted down the left most branch of this decision tree and vi. Therefore data is preprocessed before constructing a decision tree.
would therefore be classified as a negative instance.
vii. For constructing a decision tree information gain is calculated for
6. In other words, decision tree represent a disjunction of conjunctions of each and every attribute and attribute with the highest information
constraints on the attribute values of instances. gain becomes the root node. The rest possible values are denoted
(Outlook = Sunny  Humidity = Normal)  (Outlook = Overcast)  by arcs.
(Outlook = Rain  Wind = Weak) viii. All the outcome instances that are possible are examined whether
they belong to the same class or not. For the instances of the same
class, a single name is used to denote the class otherwise the
instances are classified on the basis of splitting attribute.
Machine Learning Techniques 3–5 L (CS/IT-Sem-5) Decision Tree Learning 3–6 L (CS/IT-Sem-5)

2. C4.5 : 2. For making a decision, only one attribute is tested at an instant thus
i. C4.5 is an algorithm used to generate a decision tree. It is an extension consuming a lot of time.
of ID3 algorithm. 3. Classifying the continuous data may prove to be expensive in terms of
ii. C4.5 generates decision trees which can be used for classification computation, as many trees have to be generated to see where to break
and therefore C4.5 is referred to as statistical classifier. the continuous sequence.
iii. It is better than the ID3 algorithm because it deals with both 4. It is overly sensitive to features when given a large number of input
continuous and discrete attributes and also with the missing values values.
and pruning trees after construction. Advantages of C4.5 algorithm :
iv. C5.0 is the commercial successor of C4.5 because it is faster, memory 1. C4.5 is easy to implement.
efficient and used for building smaller decision trees. 2. C4.5 builds models that can be easily interpreted.
v. C4.5 performs by default a tree pruning process. This leads to the 3. It can handle both categorical and continuous values.
formation of smaller trees, simple rules and produces more intuitive
interpretations. 4. It can deal with noise and missing value attributes.

3. CART (Classification And Regression Trees) : Disadvantages of C4.5 algorithm :

i. CART algorithm builds both classification and regression trees. 1. A small variation in data can lead to different decision trees when using
C4.5.
ii. The classification tree is constructed by CART through binary
2. For a small training set, C4.5 does not work very well.
splitting of the attribute.
Advantages of CART algorithm :
iii. Gini Index is used for selecting the splitting attribute.
1. CART can handle missing values automatically using proxy splits.
iv. The CART is also used for regression analysis with the help of
regression tree. 2. It uses combination of continuous/discrete variables.
v. The regression feature of CART can be used in forecasting a 3. CART automatically performs variable selection.
dependent variable given a set of predictor variable over a given 4. CART can establish interactions among variables.
period of time.
5. CART does not vary according to the monotonic transformation of
vi. CART has an average speed of processing and supports both predictive variable.
continuous and nominal attribute data.
Disadvantages of CART algorithm :
Que 3.5. What are the advantages and disadvantages of different 1. CART has unstable decision trees.

decision tree learning algorithm ? 2. CART splits only by one variable.


3. It is non-parametric algorithm.
Answer
Advantages of ID3 algorithm :
PART-2
1. The training data is used to create understandable prediction rules.
Entropy and Information Theory, Information Gain,
ID-3 Algorithm, Issues in Decision Tree Learning.
2. It builds short and fast tree.
3. ID3 searches the whole dataset to create the whole tree.
4. It finds the leaf nodes thus enabling the test data to be pruned and Questions-Answers
reducing the number of tests.
5. The calculation time of ID3 is the linear function of the product of the Long Answer Type and Medium Answer Type Questions
characteristic number and node number.
Disadvantages of ID3 algorithm :
Que 3.6. Explain attribute selection measures used in decision
1. For a small sample, data may be overfitted or overclassified.
tree.
Machine Learning Techniques 3–7 L (CS/IT-Sem-5) Decision Tree Learning 3–8 L (CS/IT-Sem-5)

v. Gain ratio differs from information gain, which measures the


Answer information with respect to a classification that is acquired based
Attribute selection measures used in decision tree are : on some partitioning.
1. Entropy : vi. Gain ratio applies kind of information gain using a split information
value defined as :
i. Entropy is a measure of uncertainty associated with a random
variable. v | Dj | | Dj |
ii. The entropy increases with the increase in uncertainty or SplitInfoA = –  | D| log
j 1
2  | D| 
randomness and decreases with a decrease in uncertainty or
randomness. vii. The gain ratio is then defined as :

iii. The value of entropy ranges from 0-1. Gain ( A)


Gain ratio (A) =
SplitInfo A ( D)
Entropy(D) =  i  1  pi log 2 ( pi )
c

viii. A splitting attribute is selected which is the attribute having the


where pi is the non-zero probability that an arbitrary tuple in D maximum gain ratio.
belongs to class C and is estimated by |Ci, D|/|D|.
iv. A log function of base 2 is used because the entropy is encoded in Que 3.7. Explain applications of decision tree in various areas
bits 0 and 1. of data mining.
2. Information gain :
Answer
i. ID3 uses information gain as its attribute selection measure.
The various decision tree applications in data mining are :
ii. Information gain is the difference between the original information
gain requirement (i.e. based on the proportion of classes) and the 1. E-Commerce : It is used widely in the field of e-commerce, decision
new requirement (i.e. obtained after the partitioning of A). tree helps to generate online catalog which is an important factor for
the success of an e-commerce website.
| Dj |

v
Gain(D, A) = Entropy(D) – Entropy(Dj) 2. Industry : Decision tree algorithm is useful for producing quality control
j 1
| D| (faults identification) systems.
Where, 3. Intelligent vehicles : An important task for the development of
D : A given data partition intelligent vehicles is to find the lane boundaries of the road.
A : Attribute 4. Medicine :
V : Suppose we partition the tuples in D on some a. Decision tree is an important technique for medical research and
attribute A having V distinct values practice. A decision tree is used for diagnostic of various diseases.
iii. D is split into V partition or subsets, {D1, D2, ..Dj} where Dj contains b. Decision tree is also used for hard sound diagnosis.
those tuples in D that have outcome aj, of A. 5. Business : Decision trees find use in the field of business where they
iv. The attribute that has the highest information gain is chosen. are used for visualization of probabilistic business models, used in CRM
(Customer Relationship Management) and used for credit scoring for
3. Gain ratio :
credit card users and for predicting loan risks in banks.
i. The information gain measure is biased towards tests with many
outcomes. Que 3.8. Explain procedure of ID3 algorithm.
ii. That is, it prefers to select attributes having a large number of
values. Answer
iii. As each partition is pure, the information gain by partitioning is ID3 (Examples, Target Attribute, Attributes) :
maximal. But such partitioning cannot be used for classification. 1. Create a Root node for the tree.
iv. C4.5 uses this attribute selection measure which is an extension to 2. If all Examples are positive, return the single-node tree root, with label
the information gain. =+
Machine Learning Techniques 3–9 L (CS/IT-Sem-5) Decision Tree Learning 3–10 L (CS/IT-Sem-5)

3. If all Examples are negative, return the single-node tree root, with label 8. The inductive bias of the candidate elimination algorithm is that it is
=– only able to classify a new piece of data if all the hypotheses contained
4. If Attributes is empty, return the single-node tree root, with label = within its version space give data the same classification.
most common value of target attribute in examples. 9. Hence, the inductive bias does impose a limitation on the learning method.
5. Otherwise begin Inductive system :
a. A  the attribute from Attributes that best classifies Examples
Inductive system
b. The decision attribute for Root  A Classification of
Candidate new instance or
c. For each possible value, Vi, of A, Training examples
elimination do not know
i. Add a new tree branch below root, corresponding to the test A algorithm
= Vi New instance
Using hypothesis
ii. Let Example Vi be the subset of Examples that have value Vi space H
for A
Fig. 3.9.1.
iii. If Example Vi is empty
a. Then below this new branch add a leaf node with label
= most common value of TargetAttribute in Examples Que 3.10. Explain inductive learning algorithm.
b. Else below this new branch add the sub-tree ID3 (Example
Vi , TargetAttribute, Attributes–{A}) Answer
6. End Inductive learning algorithm :
7. Return root. Step 1 : Divide the table ‘T’ containing m examples into n sub-tables
(t1, t2, ... tn). One table for each possible value of the class attribute (repeat
Que 3.9. Explain inductive bias with inductive system. steps 2-8 for each sub-table).
Step 2 : Initialize the attribute combination count j = 1.
Answer
Step 3 : For the sub-table on which work is going on, divide the attribute list
Inductive bias :
into distinct combinations, each combination with j distinct attributes.
1. Inductive bias refers to the restrictions that are imposed by the
assumptions made in the learning method. Step 4 : For each combination of attributes, count the number of occurrences
of attribute values that appear under the same combination of attributes in
2. For example, assuming that the solution to the problem of road safety unmarked rows of the sub-table under consideration, and at the same time,
can be expressed as a conjunction of a set of eight concepts. not appears under the same combination of attributes of other sub-tables.
3. This does not allow for more complex expressions that cannot be Call the first combination with the maximum number of occurrences the
expressed as a conjunction. max-combination MAX.
4. This inductive bias means that there are some potential solutions that Step 5 : If MAX = = null, increase j by 1 and go to Step 3.
we cannot explore, and not contained within the version space we
Step 6 : Mark all rows of the sub-table where working, in which the values
examine.
of MAX appear, as classified.
5. Order to have an unbiased learner, the version space would have to
contain every possible hypothesis that could possibly be expressed. Step 7 : Add a rule (IF attribute = “XYZ”  THEN decision is YES/ NO) to R
(rule set) whose left-hand side will have attribute names of the MAX with
6. The solution that the learner produced could never be more general their values separated by AND, and its right hand side contains the decision
than the complete set of training data. attribute value associated with the sub-table.
7. In other words, it would be able to classify data that it had previously
Step 8 : If all rows are marked as classified, then move on to process another
seen (as the rote learner could) but would be unable to generalize in
sub-table and go to Step 2, else, go to Step 4. If no sub-tables are available,
order to classify new, unseen data.
exit with the set of rules obtained till then.
Machine Learning Techniques 3–11 L (CS/IT-Sem-5) Decision Tree Learning 3–12 L (CS/IT-Sem-5)

2. Multi-valued attributes :
Que 3.11. Which learning algorithms are used in inductive bias ?
a. When an attribute has many possible values, the information gain
measure gives an inappropriate indication of the attribute’s
Answer
usefulness.
Learning algorithm used in inductive bias are :
b. In the extreme case, we could use an attribute that has a different
1. Rote-learner : value for every example.
a. Learning corresponds to storing each observed training example in c. Then each subset of examples would be a singleton with a unique
memory. classification, so the information gain measure would have its
b. Subsequent instances are classified by looking them up in memory. highest value for this attribute, the attribute could be irrelevant or
c. If the instance is found in memory, the stored classification is useless.
returned. d. One solution is to use the gain ratio.
d. Otherwise, the system refuses to classify the new instance. 3. Continuous and integer valued input attributes :
e. Inductive bias : There is no inductive bias. a. Height and weight have an infinite set of possible values.
2. Candidate-elimination : b. Rather than generating infinitely many branches, decision tree
a. New instances are classified only in the case where all members of learning algorithms find the split point that gives the highest
the current version space agree on the classification. information gain.
b. Otherwise, the system refuses to classify the new, instance. c. Efficient dynamic programming methods exist for finding good
c. Inductive bias : The target concept can be represented in its split points, but it is still the most expensive part of real world
hypothesis space. decision tree learning applications.
3. FIND-S : 4. Continuous-valued output attributes :
a. This algorithm, finds the most specific hypothesis consistent with a. If we are trying to predict a numerical value, such as the price of
the training examples. a work of art, rather than discrete classifications, then we need a
b. It then uses this hypothesis to classify all subsequent instances. regression tree.
c. Inductive bias : The target concept can be represented in its b. Such a tree has a linear function of some subset of numerical
hypothesis space, and all instances are negative instances unless attributes, rather than a single value at each leaf.
the opposite is entailed by its other knowledge. c. The learning algorithm must decide when to stop splitting and
Que 3.12. Discuss the issues related to the applications of decision begin applying linear regression using the remaining attributes.

trees.
PART-3
Answer Instance-based Learning.
Issues related to the applications of decision trees are :
1. Missing data : Questions-Answers
a. When values have gone unrecorded, or they might be too expensive
to obtain. Long Answer Type and Medium Answer Type Questions
b. Two problems arise :
i. To classify an object that is missing from the test attributes.
ii. To modify the information gain formula when examples have Que 3.13. Write short note on instance-based learning.
unknown values for the attribute.
Machine Learning Techniques 3–13 L (CS/IT-Sem-5) Decision Tree Learning 3–14 L (CS/IT-Sem-5)

2. A distance of 0 is assigned if the values are identical, otherwise the


Answer distance is 1.
1. Instance-Based Learning (IBL) is an extension of nearest neighbour or 3. Some attributes will be more important than others. We need some
K-NN classification algorithms. kinds of attribute weighting. To get suitable attribute weights from the
2. IBL algorithms do not maintain a set of abstractions of model created training set is a key problem.
from the instances. 4. It may not be necessary, or desirable, to store all the training instances.
3. The K-NN algorithms have large space requirement. Instance-based representation (4) :
4. They also extend it with a significance test to work with noisy instances, 1. Generally some regions of attribute space are more stable with regard
since a lot of real-life datasets have training instances and K-NN to class than others, and just a few examples are needed inside stable
algorithms do not work well with noise. regions.
5. Instance-based learning is based on the memorization of the dataset. 2. An apparent drawback to instance-based representation is that they do
6. The number of parameters is unbounded and grows with the size of the not make explicit the structures that are learned.
data.
7. The classification is obtained through memorized examples.
8. The cost of the learning process is 0, all the cost is in the computation of
the prediction.
9. This kind learning is also known as lazy learning. (a) (b) (c)
Fig. 3.14.1.
Que 3.14. Explain instance-based learning representation.
Que 3.15. What are the performance dimensions used for instance-
Answer based learning algorithm ?
Following are the instance based learning representation :
Answer
Instance-based representation (1) :
Performance dimension used for instance-based learning algorithm
1. The simplest form of learning is plain memorization.
are :
2. This is a completely different way of representing the knowledge extracted
1. Generality :
from a set of instances : just store the instances themselves and operate
by relating new instances whose class is unknown to existing ones a. This is the class of concepts that describe the representation of an
whose class is known. algorithm.
3. Instead of creating rules, work directly from the examples themselves. b. IBL algorithms can pac-learn any concept whose boundary is a
Instance-based representation (2) : union of a finite number of closed hyper-curves of finite size.
1. Instance-based learning is lazy, deferring the real work as long as 2. Accuracy : This concept describes the accuracy of classification.
possible. 3. Learning rate :
2. In instance-based learning, each new instance is compared with existing a. This is the speed at which classification accuracy increases during
ones using a distance metric, and the closest existing instance is used to training.
assign the class to the new one. This is also called the nearest-neighbour
b. It is a more useful indicator of the performance of the learning
classification method.
algorithm than accuracy for finite-sized training sets.
3. Sometimes more than one nearest neighbour is used, and the majority
class of the closest k-nearest neighbours is assigned to the new instance. 4. Incorporation costs :
This is termed the k-nearest neighbour method. a. These are incurred while updating the concept descriptions with a
Instance-based representation (3) : single training instance.
1. When computing the distance between two examples, the standard b. They include classification costs.
Euclidean distance may be used.
Machine Learning Techniques 3–15 L (CS/IT-Sem-5) Decision Tree Learning 3–16 L (CS/IT-Sem-5)

5. Storage requirement : This is the size of the concept description for


IBL algorithms, which is defined as the number of saved instances used
for classification decisions.
PART-4
K-Nearest Neighbour Learning, Locally Weighted Regression,
Que 3.16. What are the functions of instance-based learning ? Radial Basis Function Networks.

Answer
Questions-Answers
Functions of instance-based learning are :
1. Similarity function : Long Answer Type and Medium Answer Type Questions
a. This computes the similarity between a training instance i and the
instances in the concept description.
b. Similarities are numeric-valued.
Que 3.18. Describe K-Nearest Neighbour algorithm with steps.
2. Classification function :
a. This receives the similarity function’s results and the classification Answer
performance records of the instances in the concept description.
1. The KNN classification algorithm is used to decide the new instance
b. It yields a classification for i. should belong to which class.
3. Concept description updater : 2. When K = 1, we have the nearest neighbour algorithm.
a. This maintains records on classification performance and decides 3. KNN classification is incremental.
which instances to include in the concept description.
4. KNN classification does not have a training phase, all instances are
b. Inputs include i, the similarity results, the classification results, stored. Training uses indexing to find neighbours quickly.
and a current concept description. It yields the modified concept
5. During testing, KNN classification algorithm has to find K-nearest
description.
neighbours of a new instance. This is time consuming if we do exhaustive
Que 3.17. What are the advantages and disadvantages of instance- comparison.

based learning ? 6. K-nearest neighbours use the local neighborhood to obtain a prediction.
Algorithm : Let m be the number of training data samples. Let p be an
Answer unknown point.
Advantages of instance-based learning : 1. Store the training samples in an array of data points array. This means
1. Learning is trivial. each element of this array represents a tuple (x, y).
2. Works efficiently. 2. For i =  to m :
3. Noise resistant. Calculate Euclidean distance d(arr[i], p).
4. Rich representation, arbitrary decision surfaces. 3 Make set S of K smallest distances obtained. Each of these distances
corresponds to an already classified data point.
5. Easy to understand.
4. Return the majority label among S.
Disadvantages of instance-based learning :
1. Need lots of data. Que 3.19. What are the advantages and disadvantages of K-nearest
2. Computational cost is high. neighbour algorithm ?
3. Restricted to x  Rn.
Answer
4. Implicit weights of attributes (need normalization).
Advantages of KNN algorithm :
5. Need large space for storage i.e., require large memory.
1. No training period :
6. Expensive application time. a. KNN is called lazy learner (Instance-based learning).
Machine Learning Techniques 3–17 L (CS/IT-Sem-5) Decision Tree Learning 3–18 L (CS/IT-Sem-5)

b. It does not learn anything in the training period. It does not derive
any discriminative function from the training data.
c. In other words, there is no training period for it. It stores the
training dataset and learns from it only at the time of making real
time predictions.
d. This makes the KNN algorithm much faster than other algorithms
X
that require training for example, SVM, Linear Regression etc.
Fig. 3.20.1.
2. Since the KNN algorithm requires no training before making predictions,
new data can be added seamlessly which will not impact the accuracy of 6. The LOESS (Locally Estimated Scatterplot Smoothing) model performs
the algorithm. a linear regression on points in the data set, weighted by a kernel
centered at x.
3. KNN is very easy to implement. There are only two parameters required
to implement KNN i.e., the value of K and the distance function (for 7. The kernel shape is a design parameter for which the original LOESS
example, Euclidean). model uses a tricubic kernel :
Disadvantages of KNN : hi(x) = h(x – xi) = exp(– k(x – xi)2),
1. Does not work well with large dataset : In large datasets, the cost of where k is a smoothing parameter.
calculating the distance between the new point and each existing points 8. For brevity, we will drop the argument x for hi(x), and define n = ihi.
is huge which degrades the performance of the algorithm. We can then write the estimated means and covariances as :
2. Does not work well with high dimensions : The KNN algorithm  i hi xi 2  i hi ( xi –  x )2  h ( x –  x )( yi –  y )
does not work well with high dimensional data because with large number x = , x  ,  xy  i i i
n n n
of dimensions, it becomes difficult for the algorithm to calculate the
 i hi yi 2  i hi ( yi –  y ) 2
2
distance in each dimension.
y = , y  , 2y / x  2y – xy2
3. Need feature scaling : We need to do feature scaling (standardization n n x
and normalization) before applying KNN algorithm to any dataset. If we 9. We use the data covariances to express the conditional expectations and
do not do so, KNN may generate wrong predictions. their estimated variances :
4. Sensitive to noisy data, missing values and outliers : KNN is
 xy  2y / x  ( x –  x )2 ( xi –  x ) 2 
2  i h
sensitive to noise in the dataset. We need to manually represent missing
ŷ =  y  ( x –  x ), h2  2

 2x 2x
2 i
values and remove outliers. x n  i i 

Que 3.20. Explain locally weighted regression.

Answer
Kernel too wide - includes region
1. Model-based methods, such as neural networks and the mixture of Kernel just right
Gaussians, use the data to build a parameterized model. Kernel too narrow - excludes some of linear region
2. After training, the model is used for predictions and the data are generally X
Fig. 3.20.2.
discarded.
3. In contrast, memory-based methods are non-parametric approaches Que 3.21. Explain Radial Basis Function (RBF).
that explicitly retain the training data, and use it each time a prediction
needs to be made.
Answer
4. Locally Weighted Regression (LWR) is a memory-based method that
1. A Radial Basis Function (RBF) is a function that assigns a real value to
performs a regression around a point using only training data that are
local to that point. each input from its domain (it is a real-value function), and the value
produced by the RBF is always an absolute value i.e., it is a measure of
5. LWR was suitable for real-time control by constructing an LWR-based distance and cannot be negative.
system that learned a difficult juggling task.
Machine Learning Techniques 3–19 L (CS/IT-Sem-5) Decision Tree Learning 3–20 L (CS/IT-Sem-5)

2. Euclidean distance (the straight-line distance) between two points in where n is the number of neurons in the hidden layer, ci is the center
Euclidean space is used. vector for neuron i and ai is the weight of neuron i in the linear output
3. Radial basis functions are used to approximate functions, such as neural neuron.
networks acts as function approximators. 4. Functions that depend only on the distance from a center vector are
4. The following sum represents a radial basis function network : radially symmetric about that vector.
N 5. In the basic form all inputs are connected to each hidden neuron.
y(x) = w
i1
i ( x – xi ) ,
6. The radial basis function is taken to be Gaussian
5. The radial basis functions act as activation functions.
6. The approximant y(x) is differentiable with respect to the weights which ( x – ci ) = exp  –   x – ci 2 
are learned using iterative update methods common among neural 7. The Gaussian basis functions are local to the center vector in the sense
networks. that
Que 3.22. Explain the architecture of a radial basis function
lim ( x – ci ) = 0
 x  
network.
i.e., changing parameters of one neuron has only a small effect for input
Answer values that are far away from the center of that neuron.
1. Radial Basis Function (RBF) networks have three layers : an input
8. Given certain mild conditions on the shape of the activation function,
layer, a hidden layer with a non-linear RBF activation function and a RBF networks are universal approximators on a compact subset of Rn.
linear output layer.
9. This means that an RBF network with enough hidden neurons can
2. The input can be modeled as a vector of real numbers x  Rn.
approximate any continuous function on a closed, bounded set with
3. The output of the network is then a scalar function of the input vector, arbitrary precision.
 : Rn  R, and is given by 10. The parameters ai, ci, , and  are determined in a manner that optimizes
N
the fit between  and the data.
(x) = a
i1
i ( x – ci )

PART-5
Output y Case-based Learning.

Linear weights
Questions-Answers

Radial basis Long Answer Type and Medium Answer Type Questions
functions

Weights
Que 3.23. Write short note on case-based learning algorithm.
Input x
Answer
Fig. 3.22.1. Architecture of a radial basis function network. An input 1. Case-Based Learning (CBL) algorithms contain an input as a sequence
vector x is used as input to all radial basis functions, each with different of training cases and an output concept description, which can be used
parameters. The output of the network is a linear combination of the
to generate predictions of goal feature values for subsequently presented
outputs from radial basis functions.
cases.
Machine Learning Techniques 3–21 L (CS/IT-Sem-5) Decision Tree Learning 3–22 L (CS/IT-Sem-5)

2. The primary component of the concept description is case-base, but


almost all CBL algorithms maintain additional related information for Answer
the purpose of generating accurate predictions (for example, settings Case-based learning algorithm processing stages are :
for feature weights).
1. Case retrieval : After the problem situation has been assessed, the
3. Current CBL algorithms assume that cases are described using a feature- best matching case is searched in the case-base and an approximate
value representation, where features are either predictor or goal solution is retrieved.
features. 2. Case adaptation : The retrieved solution is adapted to fit better in the
4. CBL algorithms are distinguished by their processing behaviour. new problem.
Disadvantages of case-based learning algorithm :
1. They are computationally expensive because they save and compute
Problem
similarities to all training cases.
2. They are intolerant of noise and irrelevant features.
3. They are sensitive to the choice of the algorithm’s similarity function. Retrieve

4. There is no simple way they can process symbolic valued feature values. R
e
Que 3.24. What are the functions of case-based learning algorithm ? u
s
Retain e
Answer
Functions of case-based learning algorithm are : Revise
1. Pre-processor : This prepares the input for processing (for example,
normalizing the range of numeric-valued features to ensure that they
are treated with equal importance by the similarity function, formatting Confirmed Proposed
the raw input into a set of cases). solution solution

2. Similarity : Fig. 3.25.1. The CBL cycle.


a. This function assesses the similarities of a given case with the
previously stored cases in the concept description.
3. Solution evaluation :
b. Assessment may involve explicit encoding and/or dynamic
a. The adapted solution can be evaluated either before the solution is
computation.
applied to the problem or after the solution has been applied.
c. CBL similarity functions find a compromise along the continuum
b. In any case, if the accomplished result is not satisfactory, the
between these extremes.
retrieved solution must be adapted again or more cases should be
3. Prediction : This function inputs the similarity assessments and retrieved.
generates a prediction for the value of the given case’s goal feature (i.e.,
4. Case-base updating : If the solution was verified as correct, the new
a classification when it is symbolic-valued).
case may be added to the case base.
4. Memory updating : This updates the stored case-base, such as by
Different scheme of the CBL working cycle are :
modifying or abstracting previously stored cases, forgetting cases
presumed to be noisy, or updating a feature’s relevance weight setting. 1. Retrieve the most similar case.
2. Reuse the case to attempt to solve the current problem.
Que 3.25. Describe case-based learning cycle with different
3. Revise the proposed solution if necessary.
schemes of CBL.
Machine Learning Techniques 3–23 L (CS/IT-Sem-5) Decision Tree Learning 3–24 L (CS/IT-Sem-5)

4. Retain the new solution as a part of a new case. 5. Suitability for sequential problem solving :
a. Sequential tasks, like these encountered reinforcement learning
Que 3.26. What are the benefits of CBL as a lazy problem solving problems, benefit from the storage of history in the form of sequence
method ? of states or procedures.

Answer b. Such a storage is facilitated by lazy approaches.

The benefits of CBL as a lazy Problem solving method are : 6. Ease of explanation :
1. Ease of knowledge elicitation : a. The results of a CBL system can be justified based upon the similarity
of the current problem to the retrieved case.
a. Lazy methods can utilise easily available case or problem instances
instead of rules that are difficult to extract. b. CBL are easily traceable to precedent cases, it is also easier to
b. So, classical knowledge engineering is replaced by case acquisition analyse failures of the system.
and structuring. 7. Ease of maintenance : This is particularly due to the fact that CBL
2. Absence of problem-solving bias : systems can adapt to many changes in the problem domain and the
relevant environment, merely by acquiring.
a. Cases can be used for multiple problem-solving purposes, because
they are stored in a raw form. Que 3.27. What are the limitations of CBL ?
b. This in contrast to eager methods, which can be used merely for
the purpose for which the knowledge has already been compiled. Answer
3. Incremental learning : Limitations of CBL are :
a. A CBL system can be put into operation with a minimal set solved 1. Handling large case bases :
cases furnishing the case base. a. High memory / storage requirements and time-consuming retrieval
b. The case base will be filled with new cases increasing the system’s accompany CBL systems utilising large case bases.
problem-solving ability. b. Although the order of both is linear with the number of cases,
c. Besides augmentation of the case base, new indexes and clusters these problems usually lead to increased construction costs and
categories can be created and the existing ones can be changed. reduced system performance.
d. This in contrast requires a special training period whenever c. These problems are less significant as the hardware components
informatics extraction (knowledge generalisation) is performed. become faster and cheaper.
e. Hence, dynamic on-line adaptation a non-rigid environment is 2. Dynamic problem domains :
possible. a. CBL systems may have difficulties in handling dynamic problem
4. Suitability for complex and not-fully formalised solution spaces : domains, where they may be unable to follow a shift in the way
problems are solved, since they are strongly biased towards what
a. CBL systems can applied to an incomplete model of problem domain,
has already worked.
implementation involves both to identity relevant case features
and to furnish, possibly a partial case base, with proper cases. b. This may result in an outdated case base.
b. Lazy approaches are appropriate for complex solution spaces than 3. Handling noisy data :
eager approaches, which replace the presented data with a. Parts of the problem situation may be irrelevant to the problem
abstractions obtained by generalisation. itself.
Machine Learning Techniques 3–25 L (CS/IT-Sem-5) Decision Tree Learning 3–26 L (CS/IT-Sem-5)

b. Unsuccessful assessment of such noise present in a problem c. There is Association-based storage and retrieval.
situation currently imposed on a CBL system may result in the 2. Induction : Machine learning use specific examples to reach general
same problem being unnecessarily stored numerous times in the conclusions.
case base because of the difference due to the noise.
3. Clustering : Clustering is a task of grouping a set of objects in such a
c. In turn this implies inefficient storage and retrieval of cases. way that objects in the same group are similar to each other than to
4. Fully automatic operation : those in other group.

a. In a CBL system, the problem domain is not fully covered. 4. Analogy : Determine correspondence between two diffe rent
representations.
b. Hence, some problem situations can occur for which the system
has no solution. 5. Discovery : Unsupervised i.e., specific goal not given.
6. Genetic algorithms :
c. In such situations, CBL systems expect input from the user.
a. Genetic algorithms are stochastic search algorithms which act on a
Que 3.28. What are the applications of CBL ? population of possible solutions.
b. They are probabilistic search methods means that the states which
Answer they explore are not determined solely by the properties of the
Applications of CBL : problems.
1. Interpretation : It is a process of evaluating situations / problems in 7. Reinforcement :
some context (For example, HYPO for interpretation of patent laws a. In reinforcement only feedback (positive or negative reward) given
KICS for interpretation of building regulations, LISSA for interpretation at end of a sequence of steps.
of non-destructive test measurements).
b. Requires assigning reward to steps by solving the credit assignment
2. Classification : It is a process of explaining a number of encountered problem which steps should receive credit or blame for a final result.
symptoms (For example, CASEY for classification of auditory
impairments, CASCADE for classification of software failures, PAKAR Que 3.30. Briefly explain the inductive learning problem.
for causal classification of building defects, ISFER for classification of
facial expressions into user defined interpretation categories. Answer
3. Design : It is a process of satisfying a number of posed constraints (For Inductive learning problem are :
example, JULIA for meal planning, CLAVIER for design of optimal 1. Supervised versus unsupervised learning :
layouts of composite airplane parts, EADOCS for aircraft panels design).
a. We want to learn an unknown function f(x) = y, where x is an input
4. Planning : It is a process of arranging a sequence of actions in time example and y is the desired output.
(For example, BOLERO for building diagnostic plans for medical patients,
TOTLEC for manufacturing planning). b. Supervised learning implies we are given a set of (x, y) pairs by a
teacher.
5. Advising : It is a process of resolving diagnosed problems (For example,
c. Unsupervised learning means we are only given the xs.
DECIDER for advising students, HOMER).
d. In either case, the goal is to estimate f.
Que 3.29. What are major paradigms of machine learning ? 2. Concept learning :
a. Given a set of examples of some concept/class/category, determine
Answer
if a given example is an instance of the concept or not.
Major paradigms of machine learning are :
b. If it is an instance, we call it a positive example.
1. Rote Learning : c. If it is not, it is called a negative example.
a. There is one-to-one mapping from inputs to stored representation.
b. Learning by memorization.
Machine Learning Techniques 3–27 L (CS/IT-Sem-5) Machine Learning Techniques 4–1 L (CS/IT-Sem-5)

3. Supervised concept learning by induction :

4
a. Given a training set of positive and negative examples of a concept,
construct a description that will accurately classify whether future
examples are positive or negative. Artificial Neural
b. That is, learn some good estimate of function f given a training set
{(x1, y1), (x2, y2), ..., (xn, yn)} where each yi is either + (positive) or
Network and
– (negative).
Deep Learning


CONTENTS
Part-1 : Artificial Neural Network, ...................... 4–2L to 4–11L
Perceptron’s, Multilayer
Perceptron, Gradient Descent
and the Delta Rule

Part-2 : Multilayer Network, ............................... 4–11L to 4–19L


Derivation of Back
Propagation Algorithm,
Generalization

Part-3 : Unsupervised Learning, ........................ 4–19L to 4–22L


SOM Algorithm and its Variants

Part-4 : Deep Learning, Introduction, .............. 4–22L to 4–27L


Concept of Convolutional Neural
Network, Types of Layers,
(Convolutional Layers, Activation
Function, Pooling, Fully Connected)

Part-5 : Concept of Convolution ......................... 4–27L to 4–31L


(1D and 2D) Layers,
Training of Network, Case
Study of CNN for eg on Diabetic
Retinopathy, Building a Smart
Speaker, Self Driving Car etc.
Artificial Neural Network & Deep Learning 4–2 L (CS/IT-Sem-5) Machine Learning Techniques 4–3 L (CS/IT-Sem-5)

4. It is used where the fast evaluation of the learned target function


PART-1 required.

Artifical Neural Network, Perceptron’s Multilayer Perceptron, 5. ANNs can bear long training times depending on factors such as the
number of weights in the network, the number of training examples
Gradient Descent and the Delta Rule.
considered, and the settings of various learning algorithm parameters.
Disadvantages of Artificial Neural Networks (ANN) :
Questions-Answers 1. Hardware dependence :
a. Artificial neural networks require processors with parallel processing
Long Answer Type and Medium Answer Type Questions power, by their structure.
b. For this reason, the realization of the equipment is dependent.
2. Unexplained functioning of the network :
Que 4.1. Describe Artificial Neural Network (ANN) with different a. This is the most important problem of ANN.
layers. b. When ANN gives a probing solution, it does not give a clue as to
why and how.
Answer
c. This reduces trust in the network.
Artificial Neural Network : Refer Q. 1.13, Page 1–14L, Unit-1.
3. Assurance of proper network structure :
A neural network contains the following three layers :
a. There is no specific rule for determining the structure of artificial
a. Input layer : The activity of the input units represents the raw neural networks.
information that can feed into the network.
b. The appropriate network structure is achieved through experience
b. Hidden layer : and trial and error.
i. Hidden layer is used to determine the activity of each hidden 4. The difficulty of showing the problem to the network :
unit.
a. ANNs can work with numerical information.
ii. The activities of the input units and the weights depend on the
connections between the input and the hidden units. b. Problems have to be translated into numerical values before being
introduced to ANN.
iii. There may be one or more hidden layers.
c. The display mechanism to be determined will directly influence the
c. Output layer : The behaviour of the output units depends on the performance of the network.
activity of the hidden units and the weights between the hidden
and output units. d. This is dependent on the user’s ability.
5. The duration of the network is unknown :
Que 4.2. What are the advantages and disadvantage of Artificial a. The network is reduced to a certain value of the error on the
Neural Network ? sample means that the training has been completed.
b. This value does not give us optimum results.
Answer
Advantages of Artificial Neural Networks (ANN) : Que 4.3. What are the characteristics of Artificial Neural
1. Problems in ANN are represented by attribute-value pairs. Network ?
2. ANNs are used for problems having the target function, output may be Answer
discrete-valued, real-valued, or a vector of several real or discrete-valued
attributes. Characteristics of Artificial Neural Network are :
1. It is neurally implemented mathematical model.
3. ANNs learning methods are quite robust to noise in the training data.
The training examples may contain errors, which do not affect the final 2. It contains large number of interconnected processing elements called
output. neurons to do all the operations.
Artificial Neural Network & Deep Learning 4–4 L (CS/IT-Sem-5) Machine Learning Techniques 4–5 L (CS/IT-Sem-5)

3. Information stored in the neurons is basically the weighted linkage of b. It is a typical task because of the characterization of “non-face”
neurons. images.
4. The input signals arrive at the processing elements through connections c. However, if a neural network is well trained, then it can be divided
and connecting weights. into two classes namely images having faces and images that do not
5. It has the ability to learn, recall and generalize from the given data by have faces.
suitable assignment and adjustment of weights.
6. The collective behaviour of the neurons describes its computational Que 4.5. Explain different types of neuron connection with
power, and no single neuron carries specific information. architecture.
Que 4.4. Explain the application areas of artificial neural network. Answer
Different types of neuron connection are :
Answer
1. Single-layer feed forward network :
Application areas of artificial neural network are :
a. In this type of network, we have only two layers i.e., input layer
1. Speech recognition : and output layer but input layer does not count because no
a. Speech occupies a prominent role in human-human interaction. computation is performed in this layer.
b. Therefore, it is natural for people to expect speech interfaces with b. Output layer is formed when different weights are applied on input
computers. nodes and the cumulative effect per node is taken.
c. In the present era, for communication with machines, humans still c. After this the neurons collectively give the output layer to compute
need sophisticated languages which are difficult to learn and use. the output signals.
d. To ease this communication barrier, a simple solution could be
Input layer Output layer
communication in a spoken language that is possible for the machine
to understand. x1 w11 y1
e. Hence, ANN is playing a major role in speech recognition.
w12
2. Character recognition : w21
a. It is a problem which falls under the general area of Pattern
Recognition. x2 w22 y2
b. Many neural networks have been developed for automatic wn1
recognition of handwritten characters, either letters or digits. w1m
3. Signature verification application : w2m
a. Signatures are useful ways to authorize and authenticate a person wn2
in legal transactions.
b. Signature verification technique is a non-vision based technique. xn wnm ym
c. For this application, the first approach is to extract the feature or
rather the geometrical feature set representing the signature.
2. Multilayer feed forward network :
d. With these feature sets, we have to train the neural networks
a. This layer has hidden layer which is internal to the network and
using an efficient neural network algorithm.
has no direct contact with the external layer.
e. This trained neural network will classify the signature as being
genuine or forged under the verification stage. b. Existence of one or more hidden layers enables the network to be
computationally stronger.
4. Human face recognition :
c. There are no feedback connections in which outputs of the model
a. It is one of the biometric methods to identify the given face. are fed back into itself.
Artificial Neural Network & Deep Learning 4–6 L (CS/IT-Sem-5) Machine Learning Techniques 4–7 L (CS/IT-Sem-5)

Input layer Hidden layer Output layer


w11
x1 w11 y1 v11 z1 x1 y1

w12 v12
w21 v21
w22 v22 w22
x2 y2 z2 x2 y2
wn1 v1m
w1m vk1
wn2 v2m
w2m vk2 wnm
xn ym
xn wnm yk vmk zm
5. Multilayer recurrent network :
a. In this type of network, processing element output can be directed
3. Single node with its own feedback : to the processing element in the same layer and in the preceding
a. When outputs can be directed back as inputs to the same layer or layer forming a multilayer recurrent network.
preceding layer nodes, then it results in feedback networks. b. They perform the same task for every element of a sequence, with
b. Recurrent networks are feedback networks with closed loop. the output being depended on the previous computations. Inputs
Fig. 4.5.1 shows a single recurrent network having single neuron are not needed at each time step.
with feedback to itself. c. The main feature of a multilayer recurrent neural network is its
Output hidden state, which captures information about a sequence.
Input

w11
x1 y1 v11 z1
v
v21 12
w22 v22
x2 y2 z2
v31
Feedback vk3
v3m
Fig. 4.5.1. vk3

4. Single-layer recurrent network : wnm


xn yk vnm zm
a. This network is single layer network with feedback connection in
which processing element’s output can be directed back to itself or
to other processing element or both. Que 4.6. Discuss the benefits of artificial neural network.
b. Recurrent neural network is a class of artificial neural network
where connections between nodes form a directed graph along a Answer
sequence. 1. Artificial neural networks are flexible and adaptive.
c. This allows it to exhibit dynamic temporal behaviour for a time 2. Artificial neural networks are used in sequence and pattern recognition
sequence. Unlike feed forward neural networks, RNNs can use systems, data processing, robotics, modeling, etc.
their internal state (memory) to process sequences of inputs. 3. ANN acquires knowledge from their surroundings by adapting to internal
and external parameters and they solve complex problems which are
difficult to manage.
Artificial Neural Network & Deep Learning 4–8 L (CS/IT-Sem-5) Machine Learning Techniques 4–9 L (CS/IT-Sem-5)

4. It generalizes knowledge to produce adequate responses to unknown 2. Stochastic gradient descent :


situations. a. This is a type of gradient descent which processes single training
5. Artificial neural networks are flexible and have the ability to learn, example per iteration.
generalize and adapts to situations based on its findings. b. Hence, the parameters are being updated even after one iteration
6. This function allows the network to efficiently acquire knowledge by in which only a single example has been processed.
learning. This is a distinct advantage over a traditionally linear network c. Hence, this is faster than batch gradient descent. When the number
that is inadequate when it comes to modelling non-linear data. of training examples is large, even then it processes only one
7. An artificial neuron network is capable of greater fault tolerance than a example which can be additional overhead for the system as the
traditional network. Without the loss of stored data, the network is able number of iterations will be large.
to regenerate a fault in any of its components. 3. Mini-batch gradient descent :
8. An artificial neuron network is based on adaptive learning. a. This is a mixture of both stochastic and batch gradient descent.
b. The training set is divided into multiple groups called batches.
Que 4.7. Write short note on gradient descent.
c. Each batch has a number of training samples in it.
d. At a time, a single batch is passed through the network which
Answer
computes the loss of every sample in the batch and uses their
1. Gradient descent is an optimization algorithm used to minimize some average to update the parameters of the neural network.
function by iteratively moving in the direction of steepest descent as
defined by the negative of the gradient. Que 4.9. What are the advantages and disadvantages of batch
2. A gradient is the slope of a function, the degree of change of a parameter gradient descent ?
with the amount of change in another parameter.
Answer
3. Mathematically, it can be described as the partial derivatives of a set of
parameters with respect to its inputs. The more the gradient, the steeper Advantages of batch gradient descent :
the slope. 1. Less oscillations and noisy steps taken towards the global minima of the
4. Gradient Descent is a convex function. loss function due to updating the parameters by computing the average
of all the training samples rather than the value of a single sample.
5. Gradient Descent can be described as an iterative method which is used
to find the values of the parameters of a function that minimizes the 2. It can benefit from the vectorization which increases the speed of
cost function as much as possible. processing all training samples together.
3. It produces a more stable gradient descent convergence and stable error
6. The parameters are initially defined a particular value and from that,
gradient than stochastic gradient descent.
Gradient Descent run in an iterative fashion to find the optimal values
of the parameters, using calculus, to find the minimum possible value of 4. It is computationally efficient as all computer resources are not being
the given cost function. used to process a single sample rather are being used for all training
samples.
Que 4.8. Explain different types of gradient descent. Disadvantages of batch gradient descent :
1. Sometimes a stable error gradient can lead to a local minima and unlike
Answer stochastic gradient descent no noisy steps are there to help to get out of
Different types of gradient descent are : the local minima.
1. Batch gradient descent : 2. The entire training set can be too large to process in the memory due to
a. This is a type of gradient descent which processes all the training which additional memory might be needed.
examples for each iteration of gradient descent. 3. Depending on computer resources it can take too long for processing all
b. When the number of training examples is large, then batch gradient the training samples as a batch.
descent is computationally very expensive. So, it is not preferred. Que 4.10. What are the advantages and dis advantages of
c. Instead, we prefer to use stochastic gradient descent or
stochastic gradient descent ?
mini-batch gradient descent.
Artificial Neural Network & Deep Learning 4–10 L (CS/IT-Sem-5) Machine Learning Techniques 4–11 L (CS/IT-Sem-5)

Step 3 : Input xk is presented, x : = xk, y : = yk, and output O is computed


Answer as :
Advantages of stochastic gradient descent : 1
O=
1. It is easier to fit into memory due to a single training sample being 1  exp( W T O)
processed by the network. where Ot is the output vector of the hidden layer :
2. It is computationally fast as only one sample is processed at a time. 1
Ol =
3. For larger datasets it can converge faster as it causes updates to the 1  exp( WlT x)
parameters more frequently. Step 4 : Weights of the output unit are updated
4. Due to frequent updates the steps taken towards the minima of the loss W : = W + o
function have oscillations which can help getting out of local minimums
where  = (y – O)O(1 – O)
of the loss function (in case the computed position turns out to be the
local minimum). Step 5 : Weights of the hidden units are updated
wt = wt + WtOt(1 – Ot)x, l = 1, ..., L
Disadvantages of stochastic gradient descent :
Step 6 : Cumulative cycle error is computed by adding the present error
1. Due to frequent updates the steps taken towards the minima are very
to E
noisy. This can often lead the gradient descent into other directions.
E := E + 1/2(y – O)2
2. Also, due to noisy steps it may take longer to achieve convergence to the
Step 7 : If k < K then k := k + 1 and we continue the training by going
minima of the loss function.
back to step 2, otherwise we go to step 8.
3. Frequent updates are computationally expensive due to using all
Step 8 : The training cycle is completed. For E < Emax terminate the
resources for processing one training sample at a time.
training session. If E > Emax then E : = 0, k := 1 and we initiate a new
4. It loses the advantage of vectorized operations as it deals with only a training cycle by going back to step 3.
single example at a time.

Que 4.11. Explain delta rule. Explain generalized delta learning PART-2
rule (error backpropagation learning rule). Multilayer Network, Derivation of Back Propagation Algorithm,
Answer
Generalization.

Delta rule :
1. The delta rule is specialized version of backpropagation’s learning rule Questions-Answers
that uses single layer neural networks.
Long Answer Type and Medium Answer Type Questions
2. It calculates the error between calculated output and sample output
data, and uses this to create a modification to the weights, thus
implementing a form of gradient descent.
Que 4.12. Write short note on backpropagation algorithm.
Generalized delta learning rule (Error backpropagation learning) :
In generalized delta learning rule (error backpropagation learning). We Answer
are given the training set :
1. Backpropagation is an algorithm used in the training of feedforward
{x1, y1), ..., (xk, yk) neural networks for supervised learning.
where xk = [xk1, ... xnk] and yk  R, k = 1, ..., K. 2. Backpropagation efficiently computes the gradient of the loss function
Step 1 :  > 0, Emax > 0 are chosen. with respect to the weights of the network for a single input-output
Step 2 : Weights w are initialized at small random values, k = 1, and the example.
running error E is set to 0. 3. This makes it feasible to use gradient methods for training multi-layer
networks, updating weights to minimize loss, we use gradient descent
or variants such as stochastic gradient descent.
Artificial Neural Network & Deep Learning 4–12 L (CS/IT-Sem-5) Machine Learning Techniques 4–13 L (CS/IT-Sem-5)

4. The backpropagation algorithm works by computing the gradient of the 9. The goal of the perceptron is to correctly classify the set of externally
loss function with respect to each weight by the chain rule, iterating applied input x1, x2, ...… xm into one of two classes G1 and G2.
backwards one layer at a time from the last layer to avoid redundant 10. The decision rule for classification is that if output y is +1 then assign the
calculations of intermediate terms in the chain rule; this is an example point represented by input x1, x2, ……. xm to class G1 else y is –1 then
of dynamic programming. assign to class G2.
5. The term backpropagation refers only to the algorithm for computing 11. In Fig. 4.13.2, if a point (x1, x2) lies below the boundary lines is assigned
the gradient, but it is often used loosely to refer to the entire learning to class G2 and above the line is assigned to class G1. Decision boundary
algorithm, also including how the gradient is used, such as by stochastic is calculated as :
gradient descent. w1x1 + w2x2 + b = 0
6. Backpropagation generalizes the gradient computation in the delta rule, x2
Decision boundary
which is the single-layer version of backpropagation, and is in turn w1x1 + w2x2 + b = 0
generalized by automatic differentiation, where backpropagation is a
special case of reverse accumulation (reverse mode). Glass G2 Glass G1

Que 4.13. Explain perceptron with single flow graph. x1


0

Answer
1. The perceptron is the simplest form of a neural network used for
classification of patterns said to be linearly separable. Fig. 4.13.2.
2. It consists of a single neuron with adjustable synaptic weights and bias. 12. There are two decision regions separated by a hyperplane defined as :
m
3. The perceptron build around a single neuron is limited for performing
pattern classification with only two classes. w x
i 1
i i b = 0

4. By expanding the output layer of perceptron to include more than one The synaptic weights w1, w2, …….. wm of the perceptron can be adapted
neuron, more than two classes can be classified. on an iteration by iteration basis.
5. Suppose, a perceptron have synaptic weights denoted by w1, w2, w3, ….. 13. For the adaption, an error-correction rule known as perceptron
wm. convergence algorithm is used.
6. The input applied to the perceptron are denoted by x1, x2, …… xm. 14. For a perceptron to function properly, the two classes G1 and G2 must be
7. The externally applied bias is denoted by b. linearly separable.
x1 Bias b 15. Linearly separable means, the pattern or set of inputs to be classified
w1 must be separated by a straight line.
V Output 16. Generalizing, a set of points in n-dimensional space are linearly separable
x2 y if there is a hyperplane of (n – 1) dimensions that separates the sets.
w2 Hand
Inputs
limiter
wm

s
as

as
cl

cl
1
G

1
G
xm

s
as

as
Fig. 4.13.1. Signal flow graph of the perceptron.

cl

cl
2

2
G

G
8. From the model, we find that the hard limiter input or induced local field
of the neuron as
m
(a) A pair of linearly (b) A pair of non-linearly
V =  wi xi  b separable patterns separable patterns
i 1 Fig. 4.13.3.
Artificial Neural Network & Deep Learning 4–14 L (CS/IT-Sem-5) Machine Learning Techniques 4–15 L (CS/IT-Sem-5)

Que 4.14. State and prove perceptron convergence theorem. Que 4.15. Explain multilayer perceptron with its architecture
and characteristics.
Answer
Statement : The Perceptron convergence theorem states that for any data Answer
set which is linearly separable the Perceptron learning rule is guaranteed to Multilayer perceptron :
find a solution in a finite number of steps. 1. The perceptrons which are arranged in layers are called multilayer
Proof : perceptron. This model has three layers : an input layer, output layer
1. To derive the error-correction learning algorithm for the perceptron. and hidden layer.
2. The perceptron convergence theorem used the synaptic weights w1, w2, 2. For the perceptrons in the input layer, the linear transfer function used
…. wm of the perceptron can be adapted on an iteration by iteration and for the perceptron in the hidden layer and output layer, the sigmoidal
basis. or squashed-S function is used.
3. The bias b(n) is treated as a synaptic weight driven by fixed input equal 3. The input signal propagates through the network in a forward direction.
to + 1. 4. On a layer by layer basis, in the multilayer perceptron bias b(n) is treated
x(n) = [+ 1, x1(n), x2(n), ..... xm(n)]T as a synaptic weight driven by fixed input equal to +1.
x(n) = [+1, x1(n), x2(n), ………. xm(n)]T
Where n denotes the iteration step in applying the algorithm.
where n denotes the iteration step in applying the algorithm.
4. Correspondingly, we define the weight vector as
Correspondingly, we define the weight vector as :
w(n) = [b(n), w1(n), w2(n) ......, wm(n)]T
w(n) = [b(n), w1(n), w2(n)……….., wm(n)]T
Accordingly, the linear combiner output is written in the compact form : 5. Accordingly, the linear combiner output is written in the compact form :
n

 w (n) x (n)
m
v(n) =
i 0
i i = wT(n) x(n) V(n) =  w (n) x (n)
i 0
i i = wT(n) × x(n)

The algorithm for adapting the weight vector is stated as :


The algorithm for adapting the weight vector is stated as :
1. If the nth member of input set x(n), is correctly classified into linearly 1. If the nth number of input set x(n), is correctly classified into linearly
separable classes, by the weight vector w(n) (that is output is correct) separable classes, by the weight vector w(n) (that is output is correct)
then no adjustment of weights are done. then no adjustment of weights are done.
w(n + 1) = w(n) w(n + 1) = w(n)
if wT x(n) > 0 and x(n) belongs to class G1. if wTx(n) > 0 and x(n) belongs to class G1.
w(n + 1) = w(n) w(n + 1) = w(n)
if wT x(n) < 0 and x(n) belongs to class G2. if wTx(n)  0 and x(n) belongs to class G2.
2. Otherwise, the weight vector of the perceptron is updated in accordance 2. Otherwise, the weight vector of the perceptron is updated in accordance
with the rule : with the rule.
w(n + 1) = w(n) – (n) x(n) Architecture of multilayer perceptron :
if wT(n) x (n) > 0 and x(n) belongs to class G2. 1. Fig. 4.15.1 shows architectural graph of multilayer perceptron with two
hidden layer and an output layer.
w(n + 1) = w(n) – (n) x(n)
if wT(n) x(n)  0 and x(n) belongs to class G1. 2. Signal flow through the network progresses in a forward direction,
from the left to right and on a layer-by-layer basis.
where (n) is the learning-rate parameter for controlling the adjustment
applied to the weight vector at iteration n. 3. Two kinds of signals are identified in this network :
Also small  leads to slow learning and large  leads to fast learning. For a. Functional signals : Functional signal is an input signal and
a constant , the learning algorithm is termed as fixed increment propagates forward and emerges at the output end of the network
algorithm. as an output signal.
Artificial Neural Network & Deep Learning 4–16 L (CS/IT-Sem-5) Machine Learning Techniques 4–17 L (CS/IT-Sem-5)

b. Error signals : Error signal originates at an output neuron and c. If momentum factor is zero, the smoothening is minimum and the
propagates backward through the network. entire weight adjustment comes from the newly calculated change.
d. If momentum factor is one, new adjustment is ignored and previous
one is repeated.
e. Between 0 and 1 is a region where the weight adjustment is
smoothened by an amount proportional to the momentum factor.
Input Output f. The momentum factor effectively increases the speed of learning
signal signal without leading to oscillations and filters out high frequency
variations of the error surface in the weight space.
2. Learning coefficient :
Output layer a. A formula to select learning coefficient is :

Input layer First hidden Second hidden 1.5


layer layer h=
( N12  N2 2  ....  Nm2 )
Fig. 4.15.1.
Where N1 is the number of patterns of type 1 and m is the number
4. Multilayer perceptrons have been applied successfully to solve some of different pattern types.
difficult and diverse problems by training them in a supervised manner
b. The small value of learning coefficient less than 0.2 produces slower
with highly popular algorithm known as the error backpropagation
but stable training.
algorithm.
c. The largest value of learning coefficient i.e., greater than 0.5, the
Characteristics of multilayer perceptron :
weights are changed drastically but this may cause optimum
1. In this model, each neuron in the network includes a non-linear combination of weights to be overshot resulting in oscillations about
activation function (non-linearity is smooth). Most commonly used the optimum.
non-linear function is defined by :
d. The optimum value of learning rate is 0.6 which produce fast
1 learning without leading to oscillations.
yj =
1  exp(v j ) 3. Sigmoidal gain :
where vj is the induced local field (i.e., the sum of all weights and bias) a. If sigmoidal function is selected, the input-output relationship of
and y is the output of neuron j. the neuron can be set as
2. The network contains hidden neurons that are not a part of input or
1
output of the network. Hidden layer of neurons enabled network to O= ...(4.16.1)
learn complex tasks. (1  e  (1   ) )
3. The network exhibits a high degree of connectivity. where  is a scaling factor known as sigmoidal gain.

Que 4.16. b. As the scaling factor increases, the input-output characteristic of


How tuning parameters effect the backpropagation
the analog neuron approaches that of the two state neuron or the
neural network ? activation function approaches the (Satisifiability) function.
Answer c. It also affects the backpropagation. To get graded output, as the
sigmoidal gain factor is increased, learning rate and momentum
Effect of tuning parameters of the backpropagation neural network : factor have to be decreased in order to prevent oscillations.
1. Momentum factor : 4. Threshold value :
a. The momentum factor has a significant role in deciding the values a.  in eq. (4.16.1) is called as threshold value or the bias or the noise
of learning rate that will produce rapid learning. factor.
b. It determines the size of change in weights or biases.
Artificial Neural Network & Deep Learning 4–18 L (CS/IT-Sem-5) Machine Learning Techniques 4–19 L (CS/IT-Sem-5)

b. A neuron fires or generates an output if the weighted sum of the E


input exceeds the threshold value.  [ W ]n
[W]n+1 =  
W
c. One method is to simply assign a small value to it and not to change c. The momentum also overcomes the effect of local minima.
it during training.
d. The use of momentum term will carry a weight change process
d. The other method is to initially choose some random values and through one or local minima and get it into global minima.
change them during training.

Que 4.17.
3.17. Dis cus s s election of various parameters in (Weight change
–  E without momentum)
Backpropagation Neural Network (BPN). W
[  W] n
Answer
 [ W]n
Selection of various parameters in BPN :
1. Number of hidden nodes : [ W]n+1
(Momentum term)
a. The guiding criterion is to select the minimum nodes in the first
and third layer, so that the memory demand for storing the weights Fig. 4.17.1. Influence of momentum term on weight change.
can be kept minimum. 3. Sigmoidal gain :
b. The number of separable regions in the input space M, is a function a. When the weights become large and force the neuron to operate in
of the number of hidden nodes H in BPN and H = M – 1. a region where sigmoidal function is very flat, a better method of
c. When the number of hidden nodes is equal to the number of training coping with network paralysis is to adjust the sigmoidal gain.
patterns, the learning could be fastest. b. By decreasing this scaling factor, we effectively spread out sigmoidal
d. In such cases, BPN simply remembers training patterns losing all function on wide range so that training proceeds faster.
generalization capabilities. 4. Local minima :
e. Hence, as far as generalization is concerned, the number of hidden a. One of the most practical solutions involves the introduction of a
nodes should be small compared to the number of training patterns shock which changes all weights by specific or random amounts.
with help of Vapnik Chervonenkis dimension (VCdim) of probability b. If this fails, then the most practical solution is to rerandomize the
theory. weights and start the training all over.
f. We can estimate the selection of number of hidden nodes for a
given number of training patterns as number of weights which is
equal to I1 * I2 + I2 * I3, where I1 and I3 denote input and output PART-3
nodes and I2 denote hidden nodes. Unspervised Learning, SOM Algorithm and its Variants.
g. Assume the training samples T to be greater than VCdim. Now if
we accept the ratio 10 : 1
Questions-Answers
I2
10 * T =
( I1  I3 ) Long Answer Type and Medium Answer Type Questions

10T
I2 =
( I1  I3 ) Que 4.18. Write short note on unsupervised learning.
Which yields the value for I2.
2. Momentum coefficient  : Answer
a. To reduce the training time we use the momentum factor because 1. Unsupervised learning is the training of machine using information
it enhances the training process. that is neither classified nor labeled and allowing the algorithm to act on
that information without guidance.
b. The influences of momentum on weight change is
Artificial Neural Network & Deep Learning 4–20 L (CS/IT-Sem-5) Machine Learning Techniques 4–21 L (CS/IT-Sem-5)

2. Here the task of machine is to group unsorted information according to 3. A self-Organizing Map (SOM) or Self-Organizing Feature Map (SOFM)
similarities, patterns and differences without any prior training of data. is a type of Artificial Neural Network (ANN) that is trained using
3. Unlike supervised learning, no teacher is provided that means no training unsupervised learning to produce a low-dimensional (typically two-
will be given to the machine. dimensional), discretized representation of the input space of the training
samples, called a map, and is therefore a method to do dimensionality
4. Therefore machine is restricted to find the hidden structure in unlabeled reduction.
data by our-self.
4. Self-organizing maps differ from other artificial neural networks as
Que 4.19. Classify unsupervised learning into two categories of they apply competitive learning as opposed to error-correction learning
(such as backpropagation with gradient descent), and in the sense that
algorithm. they use a neighborhood function to preserve the topological properties
of the input space.
Answer
Classification of unsupervised learning algorithm into two categories : Que 4.22. Write the steps used in SOM algorithm.
1. Clustering : A clustering problem is where we want to discover the
inherent groupings in the data, such as grouping customers by Answer
purchasing behavior. Following are the steps used in SOM algorithm :
2. Association : An association rule learning problem is where we want 1. Each node’s weights are initialized.
to discover rules that describe large portions of our data, such as people
that buy X also tend to buy Y. 2. A vector is chosen at random from the set of training data.
3. Every node is examined to calculate which one’s weights are most like
Que 4.20. What are the applications of unsupervised learning ? the input vector. The winning node is commonly known as the Best
Matching Unit (BMU).
Answer 4. Then the neighbourhood of the BMU is calculated. The amount of
Following are the application of unsupervised learning : neighbors decreases over time.
1. Unsupervised learning automatically split the dataset into groups base 5. The winning weight is rewarded with becoming more like the sample
on their similarities. vector. The neighbours also become more like the sample vector. The
closer a node is to the BMU, the more its weights get altered and the
2. Anomaly detection can discover unusual data points in our dataset. It is
farther away the neighbor is from the BMU, the less it learns.
useful for finding fraudulent transactions.
6. Repeat step 2 for N iterations.
3. Association mining identifies sets of items which often occur together in
our dataset.
Que 4.23. What are the basic processes used in SOM ? Also explain
4. Latent variable models are widely used for data preprocessing. Like
reducing the number of features in a dataset or decomposing the dataset stages of SOM algorithm.
into multiple components.
Answer
Que 4.21. What is Self-Organizing Map (SOM) ? Basics processes used in SOM algorithm are :
1. Initialization : All the connection weights are initialized with small
Answer random values.
1. Self-Organizing Map (SOM) provides a data visualization technique which 2. Competition : For each input pattern, the neurons compute their
helps to understand high dimensional data by reducing the dimensions respective values of a discriminant function which provides the basis for
of data to a map. competition. The particular neuron with the smallest value of the
2. SOM also represents clustering concept by grouping similar data together. discriminant function is declared the winner.
Artificial Neural Network & Deep Learning 4–22 L (CS/IT-Sem-5) Machine Learning Techniques 4–23 L (CS/IT-Sem-5)

3. Cooperation : The winning neuron determines the spatial location of 2. Deep learning is used where the data is complex and has large datasets.
a topological neighbourhood of excited neurons, thereby providing the 3. Facebook uses deep learning to analyze text in online conversations.
basis for cooperation among neighbouring neurons. Google and Microsoft all use deep learning for image search and machine
4. Adaptation : The excited neurons decrease their individual values of translation.
the discriminant function in relation to the input pattern through suitable 4. All modern smart phones have deep learning systems running on them.
adjustment of the associated connection weights, such that the response For example, deep learning is the standard technology for speech
of the winning neuron to the subsequent application of a similar input recognition, and also for face detection on digital cameras.
pattern is enhanced.
5. In the healthcare sector, deep learning is used to process medical images
Stages of SOM algorithm are : (X-rays, CT, and MRI scans) and diagnose health conditions.
1. Initialization : Choose random values for the initial weight vectors wj. 6. Deep learning is also at the core of self-driving cars, where it is used for
2. Sampling : Draw a sample training input vector x from the input space. localization and mapping, motion planning and steering, and environment
perception, as well as tracking driver state.
3. Matching : Find the winning neuron I(x) that has weight vector closest
D

 (x
Que 4.25. Describe different architecture of deep learning.
to the input vector, i.e., the minimum value of dj(x) = i  w ji )2 .
i 1

4. Updating : Apply the weight update equation Answer


Dwji = h(t) Tj, I(x)(t) (xi – wji) Different architecture of deep learning are :
where Tj, I(x)(t) is a Gaussian neighbourhood and h(t) is the learning 1. Deep Neural Network : It is a neural network with a certain level of
rate. complexity (having multiple hidden layers in between input and output
5. Continuation : Keep returning to step 2 until the feature map stops layers). They are capable of modeling and processing non-linear
changing. relationships.
2. Deep Belief Network (DBN) : It is a class of Deep Neural Network. It
is multi-layer belief networks. Steps for performing DBN are :
PART-4 a. Learn a layer of features from visible units using Contrastive
Deep Learning, Introduction, Concept of Convolutional Neural Divergence algorithm.
Network, Types of Layers, (Convolutional Layers, Activation b. Treat activations of previously trained features as visible units and
Function, Pooling, Fully Connected). then learn features of features.
c. Finally, the whole DBN is trained when the learning for the final
Questions-Answers hidden layer is achieved.
3. Recurrent (perform same task for every element of a sequence)
Long Answer Type and Medium Answer Type Questions Neural Network : Allows for parallel and sequential computation.
Similar to the human brain (large feedback network of connected
neurons). They are able to remember important things about the input
Que 4.24. What do you understand by deep learning ? they received and hence enable them to be more precise.

Que 4.26. What are the advantages, disadvantages and limitation


Answer
1. Deep learning is the subfield of artificial intelligence that focuses on of deep learning ?
creating large neural network models that are capable of making
accurate data-driven decisions.
Artificial Neural Network & Deep Learning 4–24 L (CS/IT-Sem-5) Machine Learning Techniques 4–25 L (CS/IT-Sem-5)

Answer Answer

Advantages of deep learning : 1. Convolutional networks also known as Convolutional Neural Networks
1. Best in-class performance on problems. (CNNs) are a specialized kind of neural network for processing data
2. Reduces need for feature engineering. that has a known, grid-like topology.

3. Eliminates unnecessary costs. 2. Convolutional neural network indicates that the network employs a
mathematical operation called convolution.
4. Identifies defects easily that are difficult to detect.
3. Convolution is a specialized kind of linear operation.
Disadvantages of deep learning :
4. Convolutional networks are simply neural networks that use convolution
1. Large amount of data required. in place of general matrix multiplication in at least one of their layers.
2. Computationally expensive to train. 5. CNNs, (ConvNets), are quite similar to regular neural networks.
3. No strong theoretical foundation. 6. They are still made up of neurons with weights that can be learned from
Limitations of deep learning : data. Each neuron receives some inputs and performs a dot product.
1. Learning through observations only. 7. They still have a loss function on the last fully connected layer.
2. The issue of biases. 8. They can still use a non-linearity function a regular neural network
receives input data as a single vector and passes through a series of
Que 4.27. What are the various applications of deep learning ? hidden layers.

Answer
Following are the application of deep learning :
1. Automatic text generation : Corpus of text is learned and from this
model new text is generated, word-by-word or character-by-character.
Then this model is capable of learning how to spell, punctuate, form
sentences, or it may even capture the style.
2. Healthcare : Helps in diagnosing various diseases and treating it. output layer
3. Automatic machine translation : Certain words, sentences or
phrases in one language is transformed into another language (Deep input layer
Learning is achieving top results in the areas of text, images). hidden layer 1 hidden layer 2
4. Image recognition : Recognizes and identifies peoples and objects in
images as well as to understand content and context. This area is already Fig. 4.28.1. A regular three-layer neural network.
being used in Gaming, Retail, Tourism, etc.
5. Predicting earthquakes : Teaches a computer to perform viscoelastic 9. Every hidden layer consists of neurons, wherein every neuron is fully
computations which are used in predicting earthquakes. connected to all the other neurons in the previous layer.
10. Within a single layer, each neuron is completely independent and they
Que 4.28. Define convolutional networks. do not share any connections.
11. The fully connected layer, (the output layer), contains class scores in the
case of an image classification problem. There are three main layers in
a simple ConvNet.
Artificial Neural Network & Deep Learning 4–26 L (CS/IT-Sem-5) Machine Learning Techniques 4–27 L (CS/IT-Sem-5)

c. Non-linearity
Que 4.29. Write short note on convolutional layer.
d. Pooling layer
2. The addition of a pooling layer after the convolutional layer is a common
Answer
pattern used for ordering layers within a convolutional neural network
1. Convolutional layers are the major building blocks used in convolutional that may be repeated one or more times in a given model.
neural networks. 3. The pooling layer operates upon each feature map separately to create
2. A convolution is the simple application of a filter to an input that results a new set of the same number of pooled feature maps.
in an activation. Fully connected layer :
3. Repeated application of the same filter to an input results in a map of 1. Fully connected layers are an essential component of Convolutional
activations called a feature map, indicating the locations and strength of Neural Networks (CNNs), which have been proven very successful in
a detected feature in an input, such as an image. recognizing and classifying images for computer vision.
4. The innovation of convolutional neural networks is the ability to 2. The CNN process begins with convolution and pooling, breaking down
automatically learn a large number of filters in parallel specific to a the image into features, and analyzing them independently.
training dataset under the constraints of a specific predictive modeling
problem, such as image classification. 3. The result of this process feeds into a fully connected neural network
structure that drives the final classification decision.
5. The result is highly specific features that can be detected anywhere on
input images.
PART-5
Que 4.30. Describe briefly activation function, pooling and fully
Concept of Convolution (1D and 2D) Layers, Training of Network,
connected layer. Case Study of CNN for eg on Diabetic Retinopathy, Building a
Smart Speaker, Self Deriving Car etc.
Answer
Activation function :
1. An activation function is a function that is added into an artificial neural Questions-Answers
network in order to help the network learn complex patterns in the
data. Long Answer Type and Medium Answer Type Questions
2. When comparing with a neuron-based model that is in our brains, the
activation function is at the end deciding what is to be fired to the next
neuron.
Que 4.31. Explain 1D and 2D convolutional neural network.
3. That is exactly what an activation function does in an ANN as well.
4. It takes in the output signal from the previous cell and converts it into Answer
some form that can be taken as input to the next cell.
1D convolutional neural network :
Pooling layer :
1. Convolutional Neural Network (CNN) models were developed for image
1. A pooling layer is a new layer added after the convolutional layer. classification, in which the model accepts a two-dimensional input
Specifically, after a non-linearity (for example ReLU) has been applied representing an image's pixels and color channels, in a process called
to the feature maps output by a convolutional layer, for example, the feature learning.
layers in a model may look as follows :
a. Input image
b. Convolutional layer
Artificial Neural Network & Deep Learning 4–28 L (CS/IT-Sem-5) Machine Learning Techniques 4–29 L (CS/IT-Sem-5)

2. This same process can be applied to one-dimensional sequences of data. d. The other type of training is called unsupervised training. In
3. The model extracts features from sequences data and maps the internal unsupervised training, the network is provided with inputs but not
features of the sequence. with desired outputs.
4. A 1D CNN is very effective for deriving features from a fixed-length e. The system itself must then decide what features it will use to
segment of the overall dataset, where it is not so important where the group the input data. This is often referred to as self-organization
feature is located in the segment. or adaption.
5. 1D Convolutional Neural Networks work well for :
Que 4.33. Describe diabetic retinopathy on the basis of deep
a. Analysis of a time series of sensor data.
learning.
b. Analysis of signal data over a fixed-length period, for example, an
audio recording. Answer
c. Natural Language Processing (NLP), although Recurrent Neural
Networks which leverage Long Short Term Memory (LSTM) cells 1. Diabetic Retinopathy (DR) is one of the major causes of blindness in the
are more promising than CNN as they take into account the western world. Increasing life expectancy, indulgent lifestyles and other
proximity of words to create trainable patterns. contributing factors mean the number of people with diabetes is projected
to continue rising.
2D convolutional neural network :
2. Regular screening of diabetic patients for DR has been shown to be a
1. In a 2D convolutional network, each pixel within the image is represented cost-effective and important aspect of their care.
by its x and y position as well as the depth, representing image channels
(red, green, and blue). 3. The accuracy and timing of this care is of significant importance to both
the cost and effectiveness of treatment.
2. It moves over the images both horizontally and vertically.
4. If detected early enough, effective treatment of DR is available; making
Que 4.32. this a vital process.
How we trained a network ? Explain.
5. Classification of DR involves the weighting of numerous features and
the location of such features. This is highly time consuming for clinicians.
Answer
6. Computers are able to obtain much quicker classifications once trained,
1. Once a network has been structured for a particular application, that giving the ability to aid clinicians in real-time classification.
network is ready to be trained. 7. The efficacy of automated grading for DR has been an active area of
2. To start this process the initial weights are chosen randomly. Then, the research in computer imaging with encouraging conclusions.
training, or learning begins.
8. Significant work has been done on detecting the features of DR using
3. There are two approaches to training : automated methods such a support vector machines and k-NN classifiers.
a. In supervised training, both the inputs and the outputs are provided. 9. The majority of these classification techniques arc on two class
The network then processes the inputs and compares its resulting classification for DR or no DR.
outputs against the desired outputs.
b. Errors are then propagated back through the system, causing the Que 4.34. Using artificial neural network how we recognize
system to adjust the weights which control the network. This speaker.
process occurs over and over as the weights are continually
tweaked. Answer
c. The set of data which enables the training is called the “training
set.” During the training of a network the same set of data is 1. With the technology advancements in smart home sector, voice control
processed many times as the connection weights are ever refined. and automation are key components that can make a real difference in
people’s lives.
Artificial Neural Network & Deep Learning 4–30 L (CS/IT-Sem-5) Machine Learning Techniques 4–31 L (CS/IT-Sem-5)

2. The voice recognition technology market continues to involve rapidly as 6. Followed by the Tesla models series, its “auto-pilot” technology has made
almost all smart home devices are providing speaker recognition major breakthroughs in recent years.
capability today. 7. Although the Tesla's autopilot technology is only regarded as Level 2
3. However, most of them provide cloud-based solutions or use very deep stage by the National Highway Traffic Safety Administration (NHTSA),
Neural Networks for speaker recognition task, which are not suitable Tesla shows us that the car has basically realized automatic driving
models to run on smart home devices. under certain conditions.
4. Here, we compare relatively small Convolutional Neural Networks
(CNN) and evaluate effectiveness of speaker recognition using these
models on edge devices. In addition, we also apply transfer learning 
technique to deal with a problem of limited training data.
5. By developing solution suitable for running inference locally on edge
devices, we eliminate the well-known cloud computing issues, such as
data privacy and network latency, etc.
6. The preliminary results proved that the chosen model adapts the benefit
of computer vision task by using CNN and spectrograms to perform
speaker classification with precision and recall ~ 84 % in time less than
60 ms on mobile device with Atom Cherry Trail processor.

Que 4.35. Artificial intelligence plays important role in self-driving


car explain.

Answer
1. The rapid development of the Internet economy and Artificial Intelligence
(AI) has promoted the progress of self-driving cars.
2. The market demand and economic value of self-driving cars are
increasingly prominent. At present, more and more enterprises and
scientific research institutions have invested in this field. Google, Tesla,
Apple, Nissan, Audi, General Motors, BMW, Ford, Honda, Toyota,
Mercedes, and Volkswagen have participated in the research and
development of self-driving cars.
3. Google is an Internet company, which is one of the leaders in self-
driving cars, based on its solid foundation in artificial intelligence.
4. In June 2015, two Google self-driving cars were tested on the road. So
far, Google vehicles have accumulated more than 3.2 million km of
tests, becoming the closest to the actual use.
5. Another company that has made great progress in the field of self-
driving cars is Tesla. Tesla was the first company to devote self-driving
technology to production.
Machine Learning Techniques 5–1 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–2 L (CS/IT-Sem-5)

PART-1

5 Reinforcement
Learning and
Genetic Algorithm
Introduction to Reinforcement Learning.

Questions-Answers

Long Answer Type and Medium Answer Type Questions

Que 5.1. Describe reinforcement learning.

CONTENTS Answer
1. Reinforcement learning is the study of how animals and artificial systems
can learn to optimize their behaviour in the face of rewards and
Part-1 : Introduction to ............................................ 5–2L to 5–6L
punishments.
Reinforcement Learning
2. Reinforcement learning algorithms related to methods of dynamic
Part-2 : Learning Task, Example ........................... 5–6L to 5–9L programming which is a general approach to optimal control.
of Reinforcement
Learning in Practice 3. Reinforcement learning phenomena have been observed in psychological
studies of animal behaviour, and in neurobiological investigations of
Part-3 : Learning Models for ................................. 5–9L to 5–13L neuromodulation and addiction.
Reinforcement (Markov Decision
Process, Q Learning, Q Learning 4. The task of reinforcement learning is to use observed rewards to learn
Function, Q Learning Algorithm), an optimal policy for the environment. An optimal policy is a policy that
Application of Reinforcement maximizes the expected total reward.
Learning
5. Without some feedback about what is good and what is bad, the agent
will have no grounds for deciding which move to make.
Part-4 : Introduction to Deep .............................. 5–13L to 5–15L
Q Learning 6. The agents needs to know that something good has happened when it
wins and that something bad has happened when it loses.
Part-5 : Genetic Algorithm, ................................. 5–15L to 5–30L
Introduction, Components, 7. This kind of feedback is called a reward or reinforcement.
GA Cycle of Reproduction,
Crossover, Mutation, 8. Reinforcement learning is valuable in the field of robotics, where the
Genetic Programming, tasks to be performed are frequently complex enough to defy encoding
Models of Evolution and as programs and no training data is available.
Learning, Application.
9. In many complex domains, reinforcement learning is the only feasible
way to train a program to perform at high levels.
Machine Learning Techniques 5–3 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–4 L (CS/IT-Sem-5)

Primary Passive reinforcement learning :


State (input) reinforcement signal 1. In passive learning, the agent’s policy  is fixed. In state s, it always
Environment
vector
Critic executes the action (s).
2. Its goal is simply to learn how good the policy is – that is, to learn the
utility function U(s).
Heuristic
reinforcement 3. Fig. 5.3.1 shows a policy for the world and the corresponding utilities.
signal 4. In Fig. 5.3.1(a) the policy happens to be optimal with rewards of
Actions R(s) = – 0.04 in the non-terminal states and no discounting.
Learning 5. Passive learning agent does not know the transition model T(s, a, s’),
system which specifies the probability of reaching state s’ from state s after
doing action a; nor does it know the reward function R(s) which specifies
Fig. 5.1.1. Block diagram of reinforcement learning. the reward for each state.
6. The agent executes a set of trials in the environment using its policy .
Que 5.2. Differentiate between reinforcement and supervised
7. In each trial, the agent starts in state (1, 1) and experiences a sequence
learning. of state transitions until it reaches one of the terminal states, (4, 2) or
(4, 3).
Answer
8. Its percepts supply both the current state and the reward received in
S. No. Reinforcement Supervised that state. Typical trials might look like this.
learning learning (1, 1)–0.04  (1, 2)–0.04  (1, 3)–0.04 (1, 2)–0.04  (1, 3)–0.04  (2, 3)– 0.04  (3, 3)–0.04  (4, 3)+1
(1, 1)–0.04  (1, 2)–0.04 (1, 3)–0.04  (2, 3)–0.04  (3, 3)–0.04  (3, 2)–0.04  (3, 3)–0.04  (4, 3)+1
1. Reinforcement learning is all In supervise d le arning, the (1, 1)–0.04  (2, 1)–0.04  (3, 1)–0.04  (3, 2)–0.04  (4, 2)–.1
abo ut making de cisions decision is made on the initial
sequentially. In simple words input or the input given at the 3 3 0.812 0.868 0.918 +1
we can say that the output start. +1
depends on the state of the
current input and the next
input depends on the output 2 –1 2 0.762 0.660 –1
of the previous input.
2. In reinforcement learning Supervised learning decisions are
1 1 0.705 0.655 0.611 0.388
decision is dependent. So, we independent of each other so
give labels to sequences of labels are given to each decision.
dependent decisions. 1 2 3 4 1 2 3 4
(a) (b)
3. Example : Chess game. Example : Object recognition. Fig. 5.3.1. (a) A policy  for the 4 × 3 world;
(b) The utilities of the states in the 4 × 3 world, given policy .

Que 5.3. What is reinforcement learning ? Explain passive 9. Each state percept is subscripted with the reward received. The object is
reinforcement learning and active reinforcement learning. to use the information about rewards to learn the expected utility U(s)
associated with each non-terminal state s.
Answer 10. The utility is defined to be the expected sum of (discounted) rewards
Reinforcement learning : Refer Q. 5.1, Page 5–2L, Unit-5. obtained if policy  is followed :
Machine Learning Techniques 5–5 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–6 L (CS/IT-Sem-5)

  b. Advantages of negative reinforcement learning :


U(s) = E    t R(st )| , s0  s  i. Increases behaviour.
 t 0 
where is a discount factor, for the 4 × 5 world we set = 1. ii. It provide defiance to minimum standard of performance.
c. Disadvantages of negative reinforcement learning :
Active reinforcement learning :
i. It only provides enough to meet up the minimum behaviour.
1. An active agent must decide what actions to take.
2. First, the agent will need to learn a complete model with outcome Que 5.5. What are the elements of reinforcement learning ?
probabilities for all actions, rather than just model for the fixed policy.
3. We need to take into account the fact that the agent has a choice of Answer
actions. Elements of reinforcement learning :
4. The utilities it needs to learn are those defined by the optimal policy, 1. Policy () :
they obey the Bellman equations : a. It defines the behaviour of the agent which action to take in a given
U(S) = R(S) +  max
a
 T (s, a, s ') U (s ')
s'
state to maximize the received reward in the long term.
b. It stimulus-response rules or associations.
5. These equations can be solved to obtain the utility function U using the
c. It could be a simple lookup table or function, or need more extensive
value iteration or policy iteration algorithms.
computation (for example, search).
6. A utility function U is optimal for the learned model, the agent can
d. It can be probabilistic.
extract an optimal action by one-step look-ahead to maximize the expected
utility. 2. Reward function (r) :
7. Alternatively, if it uses policy iteration, the optimal policy is already a. It defines the goal in a reinforcement learning problem, maps a
available, so it should simply execute the action the optimal policy state or action to a scalar number, the reward (or reinforcement).
recommends. b. The RL agent’s objective is to maximize the total reward it receives
in the long run.
Que 5.4. What are the different types of reinforcement learning ? c. It defines good and bad events.
Explain. d. It cannot be altered by the agent but may inform change of policy.
Answer e. It can be probabilistic (expected reward).
Types of reinforcement learning : 3. Value function (V) :
1. Positive reinforcement learning : a. It defines the total amount of reward an agent can expect to
accumulate over the future, starting from that state.
a. Positive reinforcement learning is defined as when an event, occurs
due to a particular behaviour, increases the strength and the b. A state may yield a low reward but have a high value (or the
frequency of the behaviour. opposite). For example, immediate pain/pleasure vs. long term
happiness.
b. In other words, it has a positive effect on the behaviour.
4. Transition model (M) :
c. Advantages of positive reinforcement learning are :
a. It defines the transitions in the environment action a taken in the
i. Maximizes performance. states, will lead to state s2.
ii. Sustain change for a long period of time. b. It can be probabilistic.
d. Disadvantages of positive reinforcement learning :
i. Too much reinforcement can lead to overload of states which
can diminish the results.
PART-2
2. Negative reinforcement learning : Learning Task, Example of Reinforcement Learning in Practice.
a. Negative reinforcement is defined as strengthening of behaviour
because a negative condition is stopped or avoided.
Machine Learning Techniques 5–7 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–8 L (CS/IT-Sem-5)

3. Feature engineering / selection : Feature selection is one of the


Questions-Answers critical tasks which would be used when building machine learning
models. Feature selection is important because selecting right features
Long Answer Type and Medium Answer Type Questions would not only help build models of higher accuracy but also help achieve
objectives related to building simpler models, reduce overfitting etc.
4. Regression : Regression tasks deal with estimation of numerical values
Que 5.6. Describe briefly learning task used in machine learning. (continuous variables). Some of the examples include estimation of
housing price, product price, stock price etc.
Answer 5. Classification : Classification task is related with predicting a category
1. A machine learning task is the type of prediction or inference being of a data (discrete variables). Most common example is predicting
made, based on the problem or question that is being asked, and the whether or not an email is spam or not, whether a person is suffering
available data. from a particular disease or not, whether a transaction is fraud or not,
etc.
2. For example, the classification task assigns data to categories, and the
clustering task groups data according to similarity. 6. Clustering : Clustering tasks are all about finding natural groupings of
data and a label associated with each of these groupings (clusters).
3. Machine learning tasks rely on patterns in the data rather than being
Some of the common example includes customer segmentation, product
explicitly programmed.
features identification for product roadmap.
4. A supervised machine learning task that is used to predict which of two
7. Multivariate querying : Multivariate querying is about querying or
classes (categories) an instance of data belongs to.
finding similar objects.
5. The input of a classification algorithm is a set of labeled examples, where
8. Density estimation : Density estimation problems are related with
each label is an integer of either 0 or 1.
finding likelihood or frequency of objects.
6. The output of a binary classification algorithm is a classifier, which we
9. Dimension reduction : Dimension reduction is the process of reducing
can use to predict the class of new unlabeled instances.
the number of random variables under consideration, and can be divided
7. An unsupervised machine learning task that is used to group instances into feature selection and feature extraction.
of data into clusters that contain similar characteristics.
10. Model algorithm / selection : Many a times, there are multiple models
8. Clustering can also be used to identify relationships in a dataset that we which are trained using different algorithms. One of the important task
might not logically derive by browsing or simple observation. is to select most optimal models for deploying them in production.
9. The inputs and outputs of a clustering algorithm depend on the 11. Testing and matching : Testing and matching tasks relates to
methodology chosen. comparing data sets.

Que 5.7. Explain different machine learning task. Que 5.8. Explain reinforcement learning with the help of an
example.
Answer
Following are most common machine learning tasks : Answer
1. Data preprocessing : Before starting training the models, it is 1. Reinforcement learning (RL) is learning concerned with how software
important to prepare data appropriately. As part of data preprocessing agents ought to take actions in an environment in order to maximize
following is done : the notion of cumulative reward.
a. Data cleaning 2. The software agent is not told which actions to take, but instead must
discover which actions yield the most reward by trying them.
b. Handling missing data
For example,
2. Exploratory data analysis : Once data is preprocessed, the next step
is to perform exploratory data analysis to understand data distribution Consider the scenario of teaching new tricks to a cat :
and relationship between / within the data. 1. As cat does not understand English or any other human language, we
cannot tell her directly what to do. Instead, we follow a different strategy.
Machine Learning Techniques 5–9 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–10 L (CS/IT-Sem-5)

2. We emulate a situation, and the cat tries to respond in many different iii. State (s) : State refers to the current situation returned by the
ways. If the cat's response is the desired way, we will give her fish. environment.
3. Now whenever the cat is exposed to the same situation, the cat executes iv. Policy () : It is a strategy which applies by the agent to decide the next
a similar action even more enthusiastically in expectation of getting action based on the current state.
more reward (food). v. Value (V) : It is expected long-term return with discount, as compared
4. That’s like learning that cat gets from "what to do" from positive to the short-term reward.
experiences. vi. Value Function : It specifies the value of a state that is the total
5. At the same time, the cat also learns what not do when faced with amount of reward. It is an agent which should be expected beginning
negative experiences. from that state.
Working of reinforcement learning : vii. Model of the environment : This mimics the behavior of the
1. In this case, the cat is an agent that is exposed to the environment (In environment. It helps you to make inferences to be made and also
this case, it is your house). An example of a state could be our cat sitting, determine how the environment will behave.
and we use a specific word in for cat to walk. viii. Model based methods : It is a method for solving reinforcement
2. Our agent reacts by performing an action transition from one “state” to learning problems which use model-based methods.
another “state.” ix. Q value or action value (Q) : Q value is quite similar to value. The
3. For example, the cat goes from sitting to walking. only difference between the two is that it takes an additional parameter
as a current action.
4. The reaction of an agent is an action, and the policy is a method of
selecting an action given a state in expectation of better outcomes. Que 5.10. Explain approaches used to implement reinforcement
5. After the transition, they may get a reward or penalty in return.
learning algorithm.

Answer
PART-3
There are three approaches used implement a reinforcement learning algorithm :
Learning Models for Reinforcement (Markov Decision Process, Q
1. Value-Based :
Learning, Q Learning Function, Q Learning Algorithm), Application
of Reinforcement Learning. a. In a value-based reinforcement learning method, we should try to
maximize a value function V(s). In this method, the agent is expecting a
long-term return of the current states under policy .
Questions-Answers 2. Policy-based :
a. In a policy-based RL method, we try to come up with such a policy that
Long Answer Type and Medium Answer Type Questions the action performed in every state helps you to gain maximum reward
in the future.
b. Two types of policy-based methods are :
Que 5.9. Describe important term used in reinforcement learning
i. Deterministic : For any state, the same action is produced by the
method. policy .
Answer ii. Stochastic : Every action has a certain probability, which is
determined by the following equation stochastic policy :
Following are the terms used in reinforcement learning :
n(a/s) = P/A = a/S = S
Agent : It is an assumed entity which performs actions in an environment to 3. Model-Based :
gain some reward. a. In this Reinforcement Learning method, we need to create a virtual
i. Environment (e) : A scenario that an agent has to face. model for each environment.
ii. Reward (R) : An immediate return given to an agent when he or she b. The agent learns to perform in that specific environment.
performs specific action or task.
Machine Learning Techniques 5–11 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–12 L (CS/IT-Sem-5)

3. Reinforcement Learning also provides the learning agent with a reward


Que 5.11. Describe learning models of reinforcement learning. function.
4. It also allows us to figure out the best method for obtaining large rewards.
Answer
1. Reinforcement learning is defined by a specific type of problem, and all Que 5.13. When not to use reinforcement learning ? What are the
its solutions are classed as reinforcement learning algorithms.
challenges of reinforcement learning ?
2. In the problem, an agent is supposed to decide the best action to select
based on his current state. Answer
3. When this step is repeated, the problem is known as a Markov Decision We cannot apply reinforcement learning model is all the situation. Following
Process. are the conditions when we should not use reinforcement learning model.
4. A Markov Decision Process (MDP) model contains : 1. When we have enough data to solve the problem with a supervised
a. A State is a set of tokens that represent every state that the agent can learning method.
be in. 2. When the action space is large reinforcement learning is computing
b. A Model (sometimes called Transition Model) gives an action's effect in heavy and time-consuming.
a state. In particular, T(S, a, S') defines a transition T where being in Challenges we will face while doing reinforcement learning are :
state S and taking an action 'a' takes us to state S' (S and S' may be 1. Feature/reward design which should be very involved.
same).
2. Parameters may affect the speed of learning.
c. An Action A is set of all possible actions. A(s) defines the set of actions
3. Realistic environments can have partial observability.
that can be taken being in state S.
4. Too much reinforcement may lead to an overload of states which can
d. A Reward is a real-valued reward function. R(s) indicates the reward for
diminish the results.
simply being in the state S. R(S,a) indicates the reward for being in a
state S and taking an action 'a'. R(S,a,S') indicates the reward for being 5. Realistic environments can be non-stationary.
in a state S, taking an action 'a' and ending up in a state S'.
Que 5.14. Explain the term Q-learning.
e. A Policy is a solution to the Markov Decision Process. A policy is a
mapping from S to a. It indicates the action 'a' to be taken while in state
Answer
S.
1. Q-learning is a model-free reinforcement learning algorithm.
Que 5.12. What are the application of reinforcement learning and 2. Q-learning is a values-based learning algorithm. Value based algorithms
why we use reinforcement learning ? updates the value function based on an equation (particularly Bellman
equation).
Answer 3. Whereas the other type, policy-based estimates the value function with
Following are the applications of reinforcement learning : a greedy policy obtained from the last policy improvement.
1. Robotics for industrial automation. 4. Q-learning is an off-policy learner i.e., it learns the value of the optimal
2. Business strategy planning. policy independently of the agent’s actions.
3. Machine learning and data processing. 5. On the other hand, an on-policy learner learns the value of the policy
4. It helps us to create training systems that provide custom instruction being carried out by the agent, including the exploration steps and it will
and materials according to the requirement of students. find a policy that is optimal, taking into account the exploration inherent
in the policy.
5. Aircraft control and robot motion control.
Following are the reasons for using reinforcement learning : Que 5.15. Describe Q-learning algorithm process.
1. It helps us to find which situation needs an action.
2. Helps us to discover which action yields the highest reward over the
longer period.
Machine Learning Techniques 5–13 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–14 L (CS/IT-Sem-5)

Q Tables

Answer State Action


-
Value
0
State - 0
Step 1 : Initialize the Q-table : First the Q-table has to be built. There are - 0
- 0 Q-value
n columns, where n = number of actions. There are m rows, where m = - 0
Action - 0
number of states. - 0
- 0

In our example n = Go left, Go right, Go up and Go down and m = Start, Idle, - 0

Correct path, Wrong path and End. First, lets initialize the value at 0. Q learning
Step 2 : Choose an action. Q-value action 1
Step 3 : Perform an action : The combination of steps 2 and 3 is performed
for an undefined amount of time. These steps run until the time training is State Q-value action 2
stopped, or when the training loop stopped as defined in the code.
a. First, an action (a) in the state (s) is chosen based on the Q-table. Note Q-value action N
that, when the episode initially starts, every Q-value should be 0.
Deep Q learning
b. Then, update the Q-values for being at the start and moving right using Fig. 5.16.1.
the Bellman equation.
4. On a higher level, Deep Q learning works as such :
Step 4 : Measure reward : Now we have taken an action and observed an
outcome and reward. i. Gather and store samples in a replay buffer with current policy.
Step 5 : Evaluate : We need to update the function Q(s, a) ii. Random sample batches of experiences from the replay buffer.
This process is repeated again and again until the learning is stopped. In this iii. Use the sampled experiences to update the Q network.
way the Q-table is been updated and the value function Q is maximized. Here iv. Repeat 1-3.
the Q returns the expected future reward of that action at that state.
Que 5.17. What are the steps involved in deep Q-learning network ?

PART-4 Answer
Introduction to Deep Q Learning. Steps involved in reinforcement learning using deep Q-learning networks :
1. All the past experience is stored by the user in memory.
2. The next action is determined by the maximum output of the Q-network.
Questions-Answers
3. The loss function here is mean squared error of the predicted Q-value
Long Answer Type and Medium Answer Type Questions and the target Q-value – Q*. This is basically a regression problem.
4. However, we do not know the target or actual value here as we are
dealing with a reinforcement learning problem. Going back to the
Que 5.16. Describe deep Q-learning. Q-value update equation derived from the Bellman equation, we have :
Q(St, At)  Q(St, At) + [ Rt  1   max Q( St  1 , a)  Q( St , At )]
Answer a

1. In deep Q-learning, we use a neural network to approximate the Q- Que 5.18. Write pseudocode for deep Q-learning.
value function.
2. The state is given as the input and the Q-value of all possible actions is Answer
generated as the output. Start with Q0(s, a) for all s, a.
3. The comparison between Q-learning and deep Q-learning is illustrated Get initial state s
below :
For k = 1, 2, … till convergence
Sample action a, get next state s
Machine Learning Techniques 5–15 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–16 L (CS/IT-Sem-5)

If s is terminal : b. Mutation testing


target = R(s, a, s) c. Code breaking
Sample new initial state s d. Filtering and signal processing
else target = R(s, a, s) +  maxQk(s, a) e. Learning fuzzy rule base
 k  1   k    Es  P ( s |s, a)[(Q ( s, a)  target(s ))2 ]|  k Que 5.20. Write procedure of Genetic algorithm with advantages
s  s and disadvantages.

Answer
PART-5 Procedure of Genetic algorithm :
Genetic Algorithm, Introduction, Components, GA Cycle of 1. Generate a set of individuals as the initial population.
Reproduction, Crossover, Mutation, Genetic Programming, 2. Use genetic operators such as selection or cross over.
Models of Evolution and Learning, Application.
3. Apply mutation or digital reverse if necessary.
4. Evaluate the fitness function of the new population.
5. Use the fitness function for determining the best individuals and replace
Questions-Answers
predefined members from the original population.
Long Answer Type and Medium Answer Type Questions 6. Iterate steps 2–5 and terminate when some predefined population
threshold is met.
Advantages of genetic algorithm :
Que 5.19. Write short note on Genetic algorithm. 1. Genetic algorithms can be executed in parallel. Hence, genetic algorithms
are faster.
Answer 2. It is useful for solving optimization problems.
1. Genetic algorithms are computerized search and optimization algorithm Disadvantages of Genetic algorithm :
based on mechanics of natural genetics and natural selection.
1. Identification of the fitness function is difficult as it depends on the
2. These algorithms mimic the principle of natural genetics and natural problem.
selection to construct search and optimization procedure.
2. The selection of suitable genetic operators is difficult.
3. Genetic algorithms convert the design space into genetic space. Design
space is a set of feasible solutions. Que 5.21. Explain different phases of genetic algorithm.
4. Genetic algorithms work with a coding of variables.
Answer
5. The advantage of working with a coding of variables space is that coding
discretizes the search space even though the function may be continuous. Different phases of genetic algorithm are :
6. Search space is the space for all possible feasible solutions of particular 1. Initial population :
problem. a. The process begins with a set of individuals which is called a
7. Following are the benefits of Genetic algorithm : population.
a. They are robust. b. Each individual is a solution to the problem we want to solve.
b. They provide optimization over large space state. c. An individual is characterized by a set of parameters (variables)
known as genes.
c. They do not break on slight change in input or presence of noise.
d. Genes are joined into a string to form a chromosome (solution).
8. Following are the application of Genetic algorithm :
e. In a genetic algorithm, the set of genes of an individual is represented
a. Recurrent neural network
using a string.
Machine Learning Techniques 5–17 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–18 L (CS/IT-Sem-5)

f. Usually, binary values are used (string of 1s and 0s).


A1 0 0 0 0 0 0
A1 0 0 0 0 0 0 Gene

A2 1 1 1 1 1 1
A2 1 1 1 1 1 1 Chromosome
e. The new offspring are added to the population.

A3 A5 1 1 1 0 0 0
1 0 1 0 1 1

A4 A6 0 0 0 1 1 1
1 1 0 1 1 0 Population
5. Mutation :
2. FA (Factor Analysis) fitness function : a. When new offspring formed, some of their genes can be subjected
to a mutation with a low random probability.
a. The fitness function determines how fit an individual is (the ability
of all individual to compete with other individual). b. This implies that some of the bits in the bit string can be flipped.
b. It gives a fitness score to each individual. Before mutation
c. The probability that an individual will be selected for reproduction A5 1 1 1 0 0 0
is based on its fitness score.
After mutation
3. Selection :
A5 1 1 0 1 1 0
a. The idea of selection phase is to select the fittest individuals and let
them pass their genes to the next generation. c. Mutation occurs to maintain diversity within the population and
b. Two pairs of individuals (parents) are selected based on their fitness prevent premature convergence.
scores. 6. Termination :
c. Individuals with high fitness have more chance to be selected for a. The algorithm terminates if the population has converged (does
reproduction. not produce offspring which are significantly different from the
4. Crossover : previous generation).
b. Then it is said that the genetic algorithm has provided a set of
a. Crossover is the most significant phase in a genetic algorithm.
solutions to our problem.
b. For each pair of parents to be mated, a crossover point is chosen at
random from within the genes. Que 5.22. Draw a flowchart of GA and explain the working
c. For example, consider the crossover point to be 3 as shown : principle.

A1 0 0 0 0 0 0 Answer
Genetic algorithm : Refer Q. 1.24, Page 1–23L, Unit-1.
Working principle :
A2 1 1 1 1 1 1
1. To illustrate the working principle of GA, we consider unconstrained
optimization problem.
Crossover point 2. Let us consider the following maximization problem :
d. Offspring are created by exchanging the genes of parents among maximize f(X)
themselves until the crossover point is reached.
X i( L )  X i  X i(U ) for i = 1, 2 ... N,
Machine Learning Techniques 5–19 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–20 L (CS/IT-Sem-5)

3. If we want to minimize f(X), for f(X) > 0, then we can write the objective 2. Definition of representation for the problem.
function as : 3. Premature convergence occurs.
1 4. The problem of choosing the various parameters like the size of the
maximize
1  f ( X) population, mutation rate, crossover rate, the selection method and its
strength.
4. If f(X) < 0 instead of minimizing f(X), maximize {–f(X)}. Hence, both
maximization and minimization problems can be handled by GA. 5. Cannot use gradients.
6. Cannot easily incorporate problem specific information.
Que 5.23. Write short notes on procedures of GA.
7. Not good at identifying local optima.
Answer 8. No effective terminator.
1. Start : Generate random population of n chromosomes. 9. Not effective for smooth unimodal functions.
2. Fitness : Evaluate the fitness f(x) of each chromosome x in the 10. Needs to be coupled with a local search technique.
population.
Que 5.25. Write short notes of genetic representations.
3. New population : Create a new population by repeating following
steps until the new population is complete.
Answer
a. Selection : Select two parent chromosomes from a population
according to their fitness. 1. Genetic representation is a way of representing solutions/individuals in
evolutionary computation methods.
b. Crossover : With a crossover probability crossover the parents
to form new offspring (children). If no crossover was performed, 2. Genetic representation can encode appearance, behavior, physical
offspring is the exact copy of parents. qualities of individuals.

c. Mutation : With a mutation probability mutate new offspring at 3. All the individuals of a population are represented by using binary
each locus (position in chromosome). encoding, permutational encoding, encoding by tree.

d. Accepting : Place new offspring in the new population. 4. Genetic algorithms use linear binary representations. The most standard
method of representation is an array of bits.
4. Replace : Use new generated population for a further run of the
algorithm. 5. These genetic representations are convenient because parts of individual
are easily aligned due to their fixed size which makes simple crossover
5. Test : If the end condition is satisfied, stop, and return the best solution operation.
in current population.
6. Go to step 2 Que 5.26. Give the detail of genetic representation (Encoding).
Que 5.24. What are the benefits of using GA ? What are its OR
Explain different types of encoding in genetic algorithm.
limitations ?
Answer
Answer
Genetic representations :
Benefits of using GA :
1. Encoding :
1. It is easy to understand.
a. Encoding is a process of representing individual genes.
2. It is modular and separate from application.
b. The process can be performed using bits, numbers, trees, arrays,
3. It supports multi-objective optimization. lists or any other objects.
4. It is good for noisy environment. c. The encoding depends mainly on solving the problem.
Limitations of genetic algorithm are : 2. Binary encoding :
1. The problem of identifying fitness function. a. Binary encoding is the most commonly used method of genetic
representation because GA uses this type of encoding.
Machine Learning Techniques 5–21 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–22 L (CS/IT-Sem-5)

b. In binary encoding, every chromosome is a string of bits, 0 or 1.


Chromosome A Chromosome B
Chromosome A 101100101100101011100101
+ Do_until
Chromosome B 111111100000110000011111
× /
c. Binary encoding gives many possible chromosomes. 5 y Step Wall
3. Octal or Hexadecimal encoding :
(+ x (/ 5 y)) (do_until step wall)
a. The encoding is done using octal or hexadecimal numbers.
Chromosome Octal Hexadecimal
Que 5.27. Explain different methods of selection in genetic
Chromosome A 54545345 B2CAE5 algorithm in order to select a population for next generation.
Chromosome B 77406037 FE0C1F
Answer
4. Permutation encoding (real number encoding) :
The various methods of selecting chromosomes for parents to cross over are :
a. Permutation encoding can be used in ordering problems, such as
Travelling Salesman Problem (TSP). a. Roulette-wheel selection :

b. In permutation encoding, every chromosome is a string of numbers, i. Roulette-wheel selection is the proportionate reproductive method
which represents number in a sequence. where a string is selected from the mating pool with a probability
proportional to the fitness.
Chromosome A 1 5 3 2 6 4 7 9 8 ii. Thus, ith string in the population is selected with a probability
Chromosome B 8 5 6 7 2 3 1 4 9 proportional to Fi where Fi is the fitness value for that string.
iii. Since the population size is usually kept fixed in Genetic Algorithm,
5. Value encoding :
the sum of the probabilities of each string being selected for the
a. Direct value encoding can be used in problems, where some mating pool must be one.
complicated values, such as real numbers, are used.
iv. The probability of the ith selected string is
b. In value encoding, every chromosome is a string of some values.
Fi
c. Values can be anything connected to problem, real numbers or pi = n
chars to some complicated objects. F j
j 1
Chromosome A 1.2324 5.3243 0.4556 2.3293 2.4545 where ‘n’ is the population size.
Chromosome B ABDJEIFJDHDIERJFDLDFLFEGT v. The average fitness is
n

Chromosome C (back), (back), (right), (forward), (left) F = F


j 1
j /n ...(5.27.1)
6. Tree encoding : b. Boltzmann selection :
a. Tree encoding is used for evolving programs or expressions, for i. Boltzmann selection uses the concept of simulated annealing.
genetic programming. ii. Simulated annealing is a method of functional minimization or
maximization.
b. In tree encoding, every chromosome is a tree of some objects,
iii. This method simulates the process of slow cooling of molten metal
such as functions or commands in programming language.
to achieve the minimum function value in a minimization problem.
c. Programming language LISP is often used to this, because iv. The cooling phenomenon is simulated by controlling a temperature
programs in it are represented in this form and can be easily so that a system in thermal equilibrium at a temperature T has its
parsed as a tree, so the cross-over and mutation can be done energy distributed probabilistically according to
relatively easily.
 E
P(E) = exp  –
 kT 
...(5.27.2)
where ‘k’ is Boltzmann constant.
Machine Learning Techniques 5–23 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–24 L (CS/IT-Sem-5)

v. This expression suggests that a system at a high temperature has


almost uniform probability of being at any energy state, but at a low Answer
temperature it has a small probability of being at a high energy Difference :
state.
vi. Therefore, by controlling the temperature T and assuming search S. No. Roulette-wheel Roulette-wheel
process follows Boltzmann probability distribution, the convergence based on fitness based on rank
of the algorithm is controlled.
1. Population is selected with a Probability of a population being
c. Tournament selection :
probability that is directly selected is based on its fitness
i. GA uses a strategy to select the individuals from population and
proportional to their fitness rank.
insert them into a mating pool.
values.
ii. A selection strategy in GA is a process that favours the selection of
better individuals in the population for the mating pool. 2. It co mpute s se le ctio n It first sort individuals in the
iii. There are two important issues in the evolution process of genetic probabilities according to population according to their
search. their fitness values but do fitness and the n co mpute s
1. Population diversity : Population diversity means that the not sort the individual in the selection probabilities according
genes from the already discovered good individuals are population. to the ir ranks rathe r than
exploited. fitness values.
2. Selective pressure : Selective pressure is the degree to
3. It gives a chance to all the It selects the individuals with
which the better individuals are favoured.
individuals in the population highest rank in the population.
iv. The higher the selective pressure the better individuals are to be selected.
favoured.
d. Rank selection : 4. Diversity in the population Diversity in the population is
is preserved. not preserved.
i. Rank selection first ranks the population and takes every
chromosome, receives fitness from the ranking. Example :
ii. The worst will have fitness 1, the next 2, ..., and the best will have 1. Imagine a Roulette-wheel where all chromosomes in the population
fitness N (N is the number of chromosomes in the population). are placed, each chromosome has its place accordingly to its fitness
iii. The method can lead to slow convergence because the best function :
chromosome does not differ so much from the other. Chromosomes 4
e. Steady-state selection :
i. The main idea of the selection is that bigger part of chromosome
Chromosomes 3
should survive to next generation. Chromosomes 1
ii. GA works in the following way :
1. In every generation a few chromosomes are selected for
creating new off springs. Chromosomes 2
2. Then, some chromosomes are removed and new offspring is Fig. 5.28.1. Roulette-wheel selection.
placed in that place. 2. When the wheel is spun, the wheel will finally stop and pointer
3. The rest of population survives a new generation. attached to it will points to the one of chromosomes with bigger
fitness value.
Que 5.28. Differentiate between Roulette-wheel based on fitness
3. The different between roulette-wheel selection based on fitness
and Roulette-wheel based on rank with suitable example. and rank is shown in Fig. 5.28.1 and Fig. 5.28.3.
Machine Learning Techniques 5–25 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–26 L (CS/IT-Sem-5)

Chromosomes 3 in a genetic algorithm so that particular chromosome may be ranked


Chromosomes 4 against all the other chromosomes.
Chromosomes 2
3. Selection : During each successive generation, a proportion of the
existing population is selected to breed a new generation. Individual
solutions are selected through a fitness-based process.
4. Generic operator : A genetic operator is an operator used in genetic
Chromosomes 1 algorithm to guide the algorithm towards a solution to a given problem.
Fig. 5.28.2. Situation before ranking (graph of fitnesses). Que 5.30. Why mutation is done in genetic algorithm ? Explain
Chromosomes 4 types of mutation.
Chromosomes 3 Chromosomes 1
Answer
Mutation is done in genetic algorithm because :
1. It maintains genetic diversity from one generation of a population of
Chromosomes 2 genetic algorithm chromosomes to the next.
Fig. 5.28.3. Situation after ranking (graph of order numbers).
2. GA can give better solution of the problem by using mutation.
Que 5.29. Draw genetics cycle for genetic algorithm. Types of mutation :
1. Bit string mutation : The mutation of bit strings occurs through bit
Answer flips at random positions.
Generational cycle of GA : Example : 1 0 1 0 0 1 0

Population
Decoded 1010110
(Chromosomes)
Offsprings string The probability of a mutation of a bit is 1 / l, where l is the length of the
New binary vector. Thus, a mutation rate of 1 per mutation and individual
generation selected for mutation is reached.
2. Flip bit : This mutation operator takes the chosen genome and inverts
Genetic Evaluation the bits (i.e., if the genome bit is 1, it is changed to 0 and vice versa).
Parents
operator (Fitness)
3. Boundary : This mutation operator replaces the genome with either
lower or upper bound randomly. This can be used for integer and float
Manipulation genes.
Mate 4. Non-uniform : The probability that amount of mutation will go to 0
Reproduction
Selection with the next generation is increased by using non-uniform mutation
(Mating pool) operator. It keeps the population from stagnating in the early stages of
the evolution.
Fig. 5.29.1. The GA cycle.
5. Uniform : This operator replaces the value of the chosen gene with a
Components of generational cycle in GA : uniform random value selected between the user-specified upper and
1. Population (Chromosomes) : A population is collection of individuals. lower bounds for that gene.
A population consists of a number of individuals being tested, the 6. Gaussian : This operator adds a unit Gaussian distributed random
phenotype parameters defining the individuals and some information value to the chosen gene. If it falls outside of the user-specified lower or
about search space. upper bounds for that gene, the new gene value is clipped.
2. Evaluation (Fitness) : A fitness function is a particular type of objective
function that quantifies the optimality of a solution (i.e., a chromosome)
Machine Learning Techniques 5–27 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–28 L (CS/IT-Sem-5)

5. Image processing : GAs are used for various digital image processing
Que 5.31. What is the main function of crossover operation in (DIP) tasks as well like dense pixel matching.
genetic algorithm ? 6. Machine learning : Genetics based machine learning (GBML) is a
nice area in machine learning.
Answer
7. Robot trajectory generation : GAs have been used to plan the path
1. Crossover is the basic operator of genetic algorithm. Performance of which a robot arm takes by moving from one point to another.
genetic algorithm depends on crossover operator.
2. Type of crossover operator used for a problem depends on the type of Que 5.33. Explain optimization of travelling salesman problem
encoding used. using genetic algorithm and give a suitable example too.
3. The basic principle of crossover process is to exchange genetic material
of two parents beyond the crossover points. Answer
Function of crossover operation/operator in genetic algorithm : 1. The TSP consist a number of cities, where each pair of cities has a
corresponding distance.
1. The main function of crossover operator is to introduce diversity in the
population.
Start
2. Specific crossover made for a specific problem can improve performance
of the genetic algorithm.
Set GA parameters
3. Crossover combines parental solutions to form offspring with a hope
to produce better solutions.
Generate initial random
4. Crossover operators are critical in ensuring good mixing of building
population
blocks.
5. Crossover is used to maintain balance between exploitation and
Evaluate fitness of each
exploration. The exploitation and exploration techniques are
chromosome in the
responsible for the performance of genetic algorithms. Exploitation population
means to use the already existing information to find out the better
solution and exploration is to investigate new and unknown solution
in exploration space.
Yes Are optimization
Que 5.32. Discuss the different applications of genetic algorithms. termination New population
criteria met ?
Answer Best
chromosome No
Application of GA :
1. Optimization : Genetic Algorithms are most commonly used in Parents selection for next
optimization problems wherein we have to maximize or minimize a generation
End
given objective function value under a given set of constraints.
2. Economics : GAs are also used to characterize various economic models Crossover of
like the cobweb model, game theory equilibrium resolution, asset pricing, parents chromosome
etc.
3. Neural networks : GAs are also used to train neural networks, Mutation of
particularly recurrent neural networks. chromosome
4. Parallelization : GAs also have very good parallel capabilities, and
prove to be very effective means in solving certain problems, and also Fig. 5.33.1. Genetic algorithm procedure for TSP.
provide a good area for research.
Machine Learning Techniques 5–29 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–30 L (CS/IT-Sem-5)

2. The aim is to visit all the cities such that the total distance travelled will so far found, and the average fitness comes very close to the fitness of
be minimized. the best individuals.
3. A solution, and therefore a chromosome which represents that solution 5. The convergence criteria can be explained from schema point of view.
to the TSP, can be given as an order, that is, a path, of the cities. 6. A schema is a similarity template describing a subset of strings with
4. The procedure for solving TSP can be viewed as a process flow given in similarities at certain positions. A schema represents a subset of all
Fig. 5.33.1. possible strings that have the same bits at certain string positions.
5. The GA process starts by supplying important information such as location 7. Since schema represents a robust of strings, we can associate a fitness
of the city, maximum number of generations, population size, probability value with a schema, i.e., the average fitness of the schema.
of crossover and probability of mutation. 8. One can visualize GA’s search for the optimal strings as a simultaneous
6. An initial random population of chromosomes is generated and the competition among schema increases the number of their instances in
fitness of each chromosome is evaluated. the population.
7. The population is then transformed into a new population (the next
generation) using three genetic operators : selection, crossover and
mutation. 
8. The selection operator is used to choose two parents from the current
generation in order to create a new child by crossover and/or mutation.
9. The new generation contains a higher proportion of the characteristics
possessed by the good members of the previous generation and in this
way good characteristics are spread over the population and mixed with
other good characteristics.
10. After each generation, a new set of chromosomes where the size is
equal to the initial population size is evolved.
11. This transformation process from one generation to the next continues
until the population converges to the optimal solution, which usually
occurs when a certain percentage of the population (for example 90 %)
has the same optimal chromosome in which the best individual is taken
as the optimal solution.

Que 5.34. Write short notes on convergence of genetic algorithm

Answer
1. A genetic algorithm is usually said to converge when there is no significant
improvement in the values of fitness of the population from one
generation to the next.
2. One criterion for convergence may be such that when a fixed percentage
of columns and rows in population matrix becomes the same, it can be
assumed that convergence is attained. The fixed percentage may be
80% or 85%.
3. In genetic algorithms as we proceed with more generations, there may
not be much improvement in the population fitness and the best
individual may not change for subsequent populations.
4. As the generation progresses, the population gets filled with more fit
individuals with only slight deviation from the fitness of best individuals
Machine Learning Techniques SQ–1 L (CS/IT-Sem-5) 2 Marks Questions SQ–2 L (CS/IT-Sem-5)

1.6. What is the role of machine learning in human life ?

1
Ans. Role of machine learning in human life :
1. Learning
2. Reasoning
3. Problem solving
Introduction 4. Language understanding

(2 Marks Questions) 1.7. What are the components of machine learning system ?
Ans. Components of machine learning system are :
1. Sensing
2. Segmentation
3. Feature extraction
4. Classification
1.1. Define machine learning. 5. Post processing
Ans. Machine learning is an application of artificial intelligence that
provides systems the ability to automatically learn and improve 1.8. What are the classes of problem in machine learning ?
from experience without being explicitly programmed. Ans. Classes of problem in machine learning are :
1. Classification
1.2. What are the different types of machine learning 2. Regression
algorithm ? 3. Clustering
Ans. Different types of machine learning algorithm are : 4. Rule extraction
1. Supervised machine learning algorithm
2. Unsupervised machine learning algorithm 1.9. What are the issues related with machine learning ?
3. Semi-supervised machine learning algorithm Ans. Issues related with machine learning are :
4. Reinforcement machine learning algorithm 1. Data quality
2. Transparency
1.3. What are the applications of machine learning ? 3. Traceability
Ans. Applications of machine learning are : 4. Reproduction of results
1. Image recognition
2. Speech recognition 1.10. Define supervised learning.
3. Medical diagnosis Ans. Supervised learning is also known as associative learning, in which
4. Statistical arbitrage the network is trained by providing it with input and matching
5. Learning association output patterns.

1.4. What are the advantages of machine learning ? 1.11. Define unsupervised learning ?
Ans. Advantages of machine learning : Ans. Unsupervised learning is also known as self-organization, in which
1. Easily identifies trends and patterns. an output unit is trained to respond to clusters of pattern within the
2. No human intervention is needed. input.
3. Continuous improvement.
4. Handling multi-dimensional and multi-variety data. 1.12. Define well defined learning problem.
Ans. A computer program is said to learn from experience E with respect
1.5. What are the disadvantages of machine learning ? to some class of tasks T and performance measure P, if its
Ans. Disadvantages of machine learning : performance at tasks in T, as measured by P, improves with
1. Data acquisition experience E.
2. Time and resources
3. Interpretation of results
4. High error-susceptibility
Machine Learning Techniques SQ–3 L (CS/IT-Sem-5) 2 Marks Questions SQ–4 L (CS/IT-Sem-5)

1.13. What are the features of learning problems ? Ans. Issues related with decision tree are :
Ans. Features of learning problems are : 1. Missing data
1. The class of tasks (T). 2. Multi-valued attribute
2. The measure of performance to be improved (P). 3. Continuous and integer valued input attributes
3. The source of experience (E). 4. Continuous-valued output attributes

1.14. Define decision tree learning. 1.20. What are the attribute selection measures used in decision
tree ?
Ans. Decision tree learning is the predictive modeling approaches used
in statistics, data mining and machine learning. It uses a decision Ans. Attribute selection measures used in decision tree are :
tree to go from observations about an item to conclusions about the 1. Entropy
item’s target values.
2. Information gain
3. Gain ratio
1.15. What is decision tree ?
Ans. A decision tree is a decision support tool that uses a tree-like model
of decisions and their possible consequences, including chance event
outcomes, resource costs and utility.


1.16. What are the types of decision tree ?


Ans. There are two types of decision tree :
1. Classification tree
2. Regression tree

1.17. Define classification tree and regression tree.


Ans. Classification tree : A classification tree is an algorithm where
the target variable is fixed. This algorithm is used to identify the
class within which a target variable would fall.
Regression tree : A regression tree is an algorithm where the
target variable is not fixed and this algorithm is used to predict its
value.

1.18. Name different decision tree algorithm.


Ans. Different decision tree algorithms are :
1. ID3
2. C4.5
3. CART

1.19. What are the issues related with the decision tree ?
Machine Learning Techniques SQ–5 L (CS/IT-Sem-5) 2 Marks Questions SQ–6 L (CS/IT-Sem-5)

2.6. Define Bayesian belief network.

2
Ans. Bayesian belief networks specify joint conditional probability
distributions. They are also known as Belief Networks, Bayesian
Networks, or Probabilistic Networks.

Regression 2.7. Define EM algorithm.


(2 Marks Questions) Ans. The Expectation-Maximization (EM) algorithm is an iterative way
to find maximum-likelihood estimates for model parameters when
the data is incomplete or has some missing data points or has
some hidden variables.

2.1. Define the term regression. 2.8. What are the usages of EM algorithm ?
Ans. Regression is a statistical method used in finance, investing, and Ans. Usages of EM algorithm are :
other disciplines that attempts to determine the strength and
character of the relationship between one dependent variable and 1. It can be used to fill the missing data in a sample.
a series of other variables (known as independent variables). 2. It can be used as the basis of unsupervised learning of clusters.
3. It can be used for the purpose of estimating the parameters of
2.2. What are the types of regression ? Hidden Markov Model (HMM).
4. It can be used for discovering the values of latent variables.
Ans. Following are the types of regression :
1. Linear regression
2. Logistic regression 2.9. What are the advantages of EM algorithm ?
Ans. Advantages of EM algorithm are :

2.3. Define logistic regression. 1. It is always guaranteed that likelihood will increase with each
iteration.
Ans. Logistic regression is a supervised learning classification algorithm
used to predict the probability of a target variable. The nature of 2. The E-step and M-step are easy implementation.
target or dependent variable is dichotomous, which means there 3. Solutions to the M-steps exist in the closed form.
would be only two possible classes.
2.10. What are the disadvantages of EM algorithm ?
2.4. What are the types of logistic regression ? Ans. Disadvantages of EM algorithm are :
Ans. Following are the types of logistic regression : 1. It has slow convergence.
1. Binary or Binomial logistic regression 2. It makes convergence to the local optima only.
2. Multinomial logistic regression 3. It requires both the probabilities, forward and backward (numerical
3. Ordinal logistic regression optimization requires only forward probability).

2.5. Define Bayesian decision theory. 2.11. Define support vector machine.
Ans. Bayesian decision theory is a fundamental statistical approach to Ans. A support vector machine is a supervised machine learning
the problem of pattern classification. This approach is based on algorithm that looks at data and sorts, analyzes data for
quantifying the tradeoffs between various classification decisions classification and regression analysis.
using probability and costs that accompany such decisions.
Machine Learning Techniques SQ–7 L (CS/IT-Sem-5) 2 Marks Questions SQ–8 L (CS/IT-Sem-5)

2.12. What are the types of support vector machine ?

3
Ans. Types of support vector machine are :
1. Linear support vector machine
2. Non-linear support vector machine
Decision Tree Learning
2.13. What are the applications of SVM ?
Ans. Applications of SVM :
(2 Marks Questions)
1. Text and hypertext classification
2. Image classification
3. Recognizing handwritten characters
3.1. What is instance-based learning ?
4. Biological sciences, including protein classification Ans. Instance-Based Learning (IBL) is an extension of nearest
neighbour or KNN classification algorithms that do not maintain
 a set of abstraction of model created from the instances.

3.2. What are the advantages of KNN algorithm ?


Ans. Advantages of KNN algorithm are :
1. No training period.
2. Since the KNN algorithm requires no training before making
predictions, new data can be added seamlessly which will not impact
the accuracy of the algorithm.
3. KNN is easy to implement.

3.3. What are the disadvantages of KNN algorithm ?


Ans. Disadvantages of KNN algorithm are :
1. It is does not work well with large dataset.
2. It is does not work well with high dimensions.
3. It need feature scaling.
4. It is sensitive to noisy data, missing values and outliers.

3.4. Define locally weighted regression.


Ans. Locally Weighted Regression (LWR) is a memory-based method
that performs a regression around a point of interest using training
data that are local to that point.

3.5. Define radial basis function.


Ans. A Radial Basis Function (RBF) is a function that assigns a real
value to each input from its domain (it is a real-value function), and
the value produced by the RBF is always an absolute value i.e., it is
a measure of distance and cannot be negative.

3.6. Define case-based learning algorithms.


Ans. Case-based learning algorithms contain as input a sequence of
training cases and as output a concept description, which can be
Machine Learning Techniques SQ–9 L (CS/IT-Sem-5) 2 Marks Questions SQ–10 L (CS/IT-Sem-5)

used to generate predictions of goal feature values for subsequently 3.12. What are the advantages of instance-based learning ?
presented cases. Ans. Advantages of instance-based learning :
1. Learning is trivial
3.7. What are the dis advantages of CBL (Cas e-Bas ed 2. Works efficiently
Learning) ? 3. Noise resistant
Ans. Disadvantage of case-based learning algorithm : 4. Rich representation, arbitrary decision surfaces
1. They are computationally expensive because they save and compute 5. Easy to understand
similarities to all training cases.
2. They are intolerant of noise and irrelevant features. 3.13. What are the disadvantages of instance-based learning ?
3. They are sensitive to the choice of the algorithm’s similarity function. Ans. Disadvantages of instance-based learning :
4. There is no simple way they can process symbolic valued feature 1. Need lots of data.
values. 2. Computational cost is high.
3. Restricted to x  Rn.
3.8. What are the functions of CBL ? 4. Implicit weights of attributes (need normalization).
Ans. Functions of case-based learning algorithm are : 5. Need large space for storage i.e., require large memory.
1. Pre-processor 6. Expensive application time.
2. Similarity
3. Prediction 
4. Memory updating

3.9. What are the processing stages of CBL ?


Ans. Case-based learning algorithm processing stages are :
1. Case retrieval
2. Case adaptation
3. Solution evaluation
4. Case-base updating

3.10. What are the benefits of CBL as lazy problem solving


method ?
Ans. The benefits of CBL as a lazy Problem solving method are :
1. Ease of knowledge elicitation.
2. Absence of problem-solving bias.
3. Incremental learning.
4. Suitability for complex and not-fully formalized solution spaces.
5. Suitability for sequential problem solving.
6. Ease of explanation.
7. Ease of maintenance.

3.11. What are the applications of CBL ?


Ans. Applications of CBL :
1. Interpretation
2. Classification
3. Design
4. Planning
5. Advising
Machine Learning Techniques SQ–11 L (CS/IT-Sem-5) 2 Marks Questions SQ–12 L (CS/IT-Sem-5)

2. Sigmoidal function

4
3. Identity function
4. Binary step function
Artificial Neural 5. Bipolar step function
Network
4.6. Give advantages of neural network.
(2 Marks Questions) Ans. Advantages of neural network :
1. A neural network can perform tasks that a linear program cannot.
2. It can be implemented in any application.
4.1. What are neurons ? 3. A neural network learns and does not need to be reprogrammed.
Ans. A neuron is a small cell that receives electro-chemical signals from
its various sources and in return responds by transmitting electrical 4.7. What are disadvantages of neural network (NN) ?
impulses to other neurons.
Ans. Disadvantages of neural network :
4.2. What is artificial neural network ? 1. The neural network needs training to operate.
Ans. Artificial neural network are computational algorithm that intended 2. It requires high processing time for large NN.
to simulate the behaviour of biological systems composed of neurons.

4.3. Give the difference between supervised and unsupervised 4.8. List the various types of soft computing techniques and
learning in artificial neural network. mention some application areas for neural network.
Ans. Ans. Types of soft computing techniques :
S. No. Supervised learning Unsupervised learning 1. Fuzzy logic control
2. Neural network
1. It uses known and labeled It uses unknown data as input.
3. Genetic algorithms
data as input.
4. Support vector machine
2. It uses offline analysis. It uses real time analysis of data.
Application areas for neural network :
3. N umber o f classe s is Number of classes is not known. 1. Speech recognition
known.
2. Character recognition
4. Accurate and reliable Moderate accurate and reliable 3. Signature verification application
results. results.
4. Human face recognition

4.4. Define activation function. 4.9. Draw a biological NN and explain the parts.
Ans. An activation function is the basic element in neural model. It is Ans.
used for limiting the amplitude of the output of a neuron. It is also
called squashing function. 1. Biological neural networks are made up of real biological neurons
that are connected in the peripheral nervous system.
2. In general a biological neural network is composed of a group of
4.5. Give types of activation function. chemically connected or functionally associated neurons.
Ans. Types of activation function :
1. Signum function
Machine Learning Techniques SQ–13 L (CS/IT-Sem-5) 2 Marks Questions SQ–14 L (CS/IT-Sem-5)

Hidden
Input
Output
Axon
Dendrites
Cell body (Soma)

Fig. 4.9.1.

A biological neural network has three major parts :


1. Soma or cell body : It contains the cell’s nucleus and other vital Fig. 4.12.1.
components called organelles which perform specialized tasks.
2. A set of dendrites : It forms a tree like structure that spread out 4.13. What do you mean by neural network architecture ?
from the cell. The neuron receives its input electrical signal along
these set of dendrites. Ans. Neural network architecture refers to the arrangement of neurons
into layers and the connection patterns between layers, activation
3. Axon : It is tabular extension from the cell (Soma) that carries an functions, and learning methods. The neural network model and
electrical signal away from Soma to another neuron for processing. the architecture of a neural network determine how a network
transforms its input into an output.

4.10. What is single layer feed forward network ?


4.14. What are the types of neuron connection ?
Ans. Single layer feed forward network is the simplest form of a layered
Ans. Following are the types of neuron connection :
network where an input layer of source nodes that projects onto an
output layer of neurons, but not vice versa. 1. Single-layer feed forward network
2. Multilayer feed forward network
3. Single node with its own feedback
4.11. Write different applications of neural networks (NN).
4. Single-layer recurrent network
Ans. Applications of NN are : 5. Multilayer recurrent network
1. Image recognition
2. Data mining 4.15. What is gradient descent ?
3. Machine translation Ans. Gradient descent is an optimization algorithm used to minimize
some function by iteratively moving in the direction of steepest
4. Spell checking descent as defined by the negative of the gradient.
5. Stock and sport bet prediction
6. Statistical modeling 4.16. What are the types of gradient descent ?
Ans. Types of gradient descent are :
1. Batch gradient descent
4.12. Draw an artificial neural network.
2. Stochastic gradient descent
Ans. An Artificial Neuron Network (ANN) is a computational model
based on the structure and functions of biological neural networks. 3. Mini-batch gradient descent
Machine Learning Techniques SQ–15 L (CS/IT-Sem-5) 2 Marks Questions SQ–16 L (CS/IT-Sem-5)

4.17. What is self organizing map (SOM) ?

5
Ans.
1. Self Organizing Map (SOM) provides a data visualization technique
which helps to understand high dimensional data by reducing the Reinforcement
dimensions of data to a map.
2. SOM also represents clustering concept by grouping similar data
Learning
together. (2 Marks Questions)


5.1. Define genetic algorithm.


Ans. Genetic algorithms are computerized search and optimization
algorithm based on mechanics of natural genetics and natural
selection. These algorithms mimic the principle of natural genetics
and natural selection to construct search and optimization procedure.

5.2. Give the benefits of genetic algorithm.


Ans. Benefits of genetic algorithm are :
1. They are Robust.
2. They provide optimization over large space state.
3. They do not break on slight change in input or presence of noise.

5.3. What are the applications of genetic algorithm ?


Ans. Following are the applications of genetic algorithms :
1. Recurrent neural network
2. Mutation testing
3. Code breaking
4. Filtering and signal processing
5. Learning fuzzy rule base

5.4. What are the disadvantages of genetic algorithm ?


Ans. Disadvantages of genetic algorithm :
1. Identification of the fitness function is difficult as it depends on the
problem.
2. The selection of suitable genetic operators is difficult.

5.5. Define genetic programming.


Machine Learning Techniques SQ–17 L (CS/IT-Sem-5) 2 Marks Questions SQ–18 L (CS/IT-Sem-5)

Ans. Genetic Programming (GP) is a type of Evolutionary Algorithm 5.10. What are the disadvantages of learning in evolution ?
(EA), a subset of machine learning. EAs are used to discover
Ans. Disadvantages of learning in evolution are :
solution to problems that human do not know how to solve.
1. A delay in the ability to acquire fitness.
2. Increased unreliability.
5.6. What are the advantages of genetic programming ?
Ans. Advantages of genetic programming are :
5.11. Define learnable evolution model.
1. In GP, the number of possible programs that can be constructed by
the algorithm is immense. Ans. Learnable Evolution Model (LEM) is a non-Darwinian methodology
for evolutionary computation that employs machine learning to
2. Although GP uses machine code which helps in providing result guide the generation of new individuals (candidate problem
very fast but if any of the high level language is used which needs solutions).
to be compile, and can generate errors and can make our program
slow.
5.12. What are different phases of genetic algorithm ?
3. There is a high probability that even a very small variation has a
disastrous effect on fitness of the solution generated. Ans. Different phases of genetic algorithm are :
1. Initial population
5.7. What are the disadvantages of genetic programming ? 2. FA (Factor Analysis) fitness function
Ans. Disadvantages of genetic programming are : 3. Selection
1. It does impose any fixed length of solution, so the maximum length 4. Crossover
can be extended up to hardware limits. 5. Mutation
2. In genetic programming it is not necessary for an individual to 6. Termination
have maximum knowledge of the problem and their solutions.

5.13. Define sequential covering algorithm.


5.8. What are different types of genetic programming ?
Ans. Sequential covering algorithm is a general procedure that repeatedly
Ans. Different types of genetic programming are : learns a single rule to create a decision list (or set) that covers the
1. Tree-based genetic programming entire dataset rule by rule.
2. Stack-based genetic programming
3. Linear genetic programming 5.14. Define Beam search.
4. Grammatical evolution Ans. Beam search is a heuristic search algorithm that explores a graph
by expanding the most promising node in a limited set.
5. Cartesian Genetic Programming (CGP)
6. Genetic Improvement Programming (GIP)
5.15. What are the properties of heuristic search ?

5.9. What are the functions of learning in evolution ? Ans. Properties of heuristic search are :
1. Admissibility condition
Ans. Function of learning in evolution :
2. Completeness condition
1. It allows individuals to adapt changes in the environment that
occur in the life span of an individual or across few generations. 3. Dominance properties
2. It allows evolution to use information extracted from the 4. Optimality property
environment thereby channeling evolutionary search.
3. It can help and guide evolution. 5.16. What are different types of reinforcement learning ?
Machine Learning Techniques SQ–19 L (CS/IT-Sem-5)

Ans. Different types of reinforcement learning are :


1. Positive reinforcement learning
2. Negative reinforcement learning

5.17. What are the elements of reinforcement learning ?


Ans. Elements of reinforcement learning are :
1. Policy () 2. Reward function (r)
3. Value function (V) 4. Transition model (M)

5.18. Define Q-learning.


Ans. Reinforcement learning is the problem faced by an agent that must
learn behaviour through trial-and-error interactions with a dynamic
environment, Q-learning is model-free reinforcement learning, and
it is typically easier to implement.

5.19. Define positive and negative reinforcement learning.


Ans. Positive reinforcement learning : Positive reinforcement
learning is defined as when an event, occurs due to a particular
behaviour such as, increases the strength and the frequency of the
behaviour.
Negative reinforcement learning : Negative reinforcement is
defined as strengthening of a behaviour because a negative condition
is stopped or avoided.



You might also like