Wa0001.
Wa0001.
QUANTUM SERIES
For
B.Tech Students of Third Year
of All Engineering Colleges Affiliated to
Dr. A.P.J. Abdul Kalam Technical University,
Uttar Pradesh, Lucknow
(Formerly Uttar Pradesh Technical University)
Kanika Dhama
TM
CO 1 To understand the need for machine learning for various problem solving K1 , K2
To understand a wide variety of learning algorithms and how to evaluate models generated K1 , K3
CO 2
from data
CO 3 To understand the latest trends in machine learning K2 , K3
To design appropriate machine learning algorithms and apply the algorithms to a real-world K4 , K6
CO 4
problems
To optimize the models learned and report on the expected accuracy that can be achieved by K4, K5
CO 5
applying the models
DETAILED SYLLABUS 3-0-0
Unit Topic Proposed
Lecture
INTRODUCTION – Learning, Types of Learning, Well defined learning problems, Designing a
Learning System, History of ML, Introduction of Machine Learning Approaches – (Artificial
I 08
Neural Network, Clustering, Reinforcement Learning, Decision Tree Learning, Bayesian
networks, Support Vector Machine, Genetic Algorithm), Issues in Machine Learning and Data
Science Vs Machine Learning;
REGRESSION: Linear Regression and Logistic Regression
BAYESIAN LEARNING - Bayes theorem, Concept learning, Bayes Optimal Classifier, Naïve
II Bayes classifier, Bayesian belief networks, EM algorithm. 08
SUPPORT VECTOR MACHINE: Introduction, Types of support vector kernel – (Linear
kernel, polynomial kernel,and Gaussiankernel), Hyperplane – (Decision surface), Properties of
SVM, and Issues in SVM.
DECISION TREE LEARNING - Decision tree learning algorithm, Inductive bias, Inductive
inference with decision trees, Entropy and information theory, Information gain, ID-3 Algorithm,
III 08
Issues in Decision tree learning.
INSTANCE-BASED LEARNING – k-Nearest Neighbour Learning, Locally Weighted
Regression, Radial basis function networks, Case-based learning.
ARTIFICIAL NEURAL NETWORKS – Perceptron’s, Multilayer perceptron, Gradient
descent and the Delta rule, Multilayer networks, Derivation of Backpropagation Algorithm,
Generalization, Unsupervised Learning – SOM Algorithm and its variant;
IV DEEP LEARNING - Introduction,concept of convolutional neural network , Types of layers – 08
(Convolutional Layers , Activation function , pooling , fully connected) , Concept of Convolution
(1D and 2D) layers, Training of network, Case study of CNN for eg on Diabetic Retinopathy,
Building a smart speaker, Self-deriving car etc.
REINFORCEMENT LEARNING–Introduction to Reinforcement Learning , Learning
Task,Example of Reinforcement Learning in Practice, Learning Models for Reinforcement –
(Markov Decision process , Q Learning - Q Learning function, Q Learning Algorithm ),
V 08
Application of Reinforcement Learning,Introduction to Deep Q Learning.
GENETIC ALGORITHMS: Introduction, Components, GA cycle of reproduction, Crossover,
Mutation, Genetic Programming, Models of Evolution and Learning, Applications.
Text books:
1. Tom M. Mitchell, ―Machine Learning, McGraw-Hill Education (India) Private Limited, 2013.
2. Ethem Alpaydin, ―Introduction to Machine Learning (Adaptive Computation and
Machine Learning), The MIT Press 2004.
3. Stephen Marsland, ―Machine Learning: An Algorithmic Perspective, CRC Press, 2009.
4. Bishop, C., Pattern Recognition and Machine Learning. Berlin: Springer-Verlag.
Machine Learning Techniques 1–1 L (CS/IT-Sem-5) Introduction 1–2 L (CS/IT-Sem-5)
1
PART-1
Learning, Types of Learning.
Questions-Answers
Introduction Long Answer Type and Medium Answer Type Questions
Que 1.1. Define the term learning. What are the components of a
learning system ?
CONTENTS Answer
1. Learning refers to the change in a subject’s behaviour to a given situation
Part-1 : Learning, Types of Learning .................... 1–2L to 1–7L
brought by repeated experiences in that situation, provided that the
behaviour changes cannot be explained on the basis of native response
Part-2 : Well Defined Learning .............................. 1–7L to 1–9L
tendencies, matriculation or temporary states of the subject.
Problems, Designing a
Learning System 2. Learning agent can be thought of as containing a performance element
that decides what actions to take and a learning element that modifies
Part-3 : History of ML, Introduction .................... 1–9L to 1–24L the performance element so that it makes better decisions.
of Machine Learning Approaches :
3. The design of a learning element is affected by three major issues :
(Artificial Neural Network,
Clustering, Reinforcement a. Components of the performance element.
Learning, Decision Tree Learning, b. Feedback of components.
Bayesian Network, Support Vector
c. Representation of the components.
Machine, Genetic Algorithm)
The important components of learning are :
Part-4 : Issues in Machine Learning ................. 1–24L to 1–26L
and Data Science Vs. Stimuli
Machine Learning examples
Learner Feedback
Li
component
Environment
or teacher Critic
Knowledge
performance
base
evaluator
Response
Performance
component
Tasks
Fig. 1.1.1. General learning model.
1. Acquisition of new knowledge :
a. One component of learning is the acquisition of new knowledge.
Machine Learning Techniques 1–3 L (CS/IT-Sem-5) Introduction 1–4 L (CS/IT-Sem-5)
b. Simple data acquisition is easy for computers, even though it is b. Lacking good complexity metrics, this measure will often be
difficult for people. somewhat subjective.
2. Problem solving : Que 1.3. Discuss supervised and unsupervised learning.
The other component of learning is the problem solving that is required
for both to integrate into the system, new knowledge that is presented Answer
to it and to deduce new information when required facts are not been
Supervised learning :
presented.
1. Supervised learning is also known as associative learning, in which
Que 1.2. Write down the performance measures for learning. the network is trained by providing it with input and matching
output patterns.
Answer 2. Supervised training requires the pairing of each input vector with
Following are the performance measures for learning are : a target vector representing the desired output.
1. Generality : 3. The input vector together with the corresponding target vector is
a. The most important performance measure for learning methods is called training pair.
the generality or scope of the method. Input feature Target feature
Matching
b. Generality is a measure of the case with which the method can be
adapted to different domains of application. Neural – +
c. A completely general algorithm is one which is a fixed or self adjusting network
configuration that can learn or adapt in any environment or
application domain. Weight/threshold
adjustment Error
2. Efficiency :
vector
a. The efficiency of a method is a measure of the average time required Supervised
to construct the target knowledge structures from some specified learning
initial structures. algorithm
b. Since this measure is often difficult to determine and is meaningless
Fig. 1.3.1.
without some standard comparison time, a relative efficiency index
can be used instead. 4. During the training session an input vector is applied to the network,
3. Robustness : and it results in an output vector.
a. Robustness is the ability of a learning system to function with 5. This response is compared with the target response.
unreliable feedback and with a variety of training examples, including 6. If the actual response differs from the target response, the network
noisy ones. will generate an error signal.
b. A robust system must be able to build tentative structures which 7. This error signal is then used to calculate the adjustment that
are subjected to modification or withdrawal if later found to be should be made in the synaptic weights so that the actual output
inconsistent with statistically sound structures. matches the target output.
4. Efficacy : 8. The error minimization in this kind of training requires a supervisor
a. The efficacy of a system is a measure of the overall power of the or teacher.
system. It is a combination of the factors generality, efficiency, and 9. These input-output pairs can be provided by an external teacher, or
robustness. by the system which contains the neural network (self-supervised).
5. Ease of implementation : 10. Supervised training methods are used to perform non-linear
a. Ease of implementation relates to the complexity of the programs mapping in pattern classification networks, pattern association
and data structures, and the resources required to develop the networks and multilayer neural networks.
given learning system.
Machine Learning Techniques 1–5 L (CS/IT-Sem-5) Introduction 1–6 L (CS/IT-Sem-5)
11. Supervised learning generates a global model that maps input objects 14. Unsupervised learning is useful for data compression and clustering.
to desired outputs. Vector describing state
12. In some cases, the map is implemented as a set of local models such of the environment
as in case-based reasoning or the nearest neighbour algorithm. Learning
Environment
13. In order to solve problem of supervised learning following steps are system
considered :
Fig. 1.3.2. Block diagram of unsupervised learning.
i. Determine the type of training examples.
15. In unsupervised learning, system is supposed to discover statistically
ii. Gathering a training set. salient features of the input population.
iii. Determine the input feature representation of the learned 16. Unlike the supervised learning paradigm, there is not a priori set of
function. categories into which the patterns are to be classified; rather the
iv. Determine the structure of the learned function and system must develop its own representation of the input stimuli.
corresponding learning algorithm.
Que 1.4. Describe briefly reinforcement learning ?
v. Complete the design.
Unsupervised learning :
Answer
1. It is a learning in which an output unit is trained to respond to
1. Reinforcement learning is the study of how artificial system can learn to
clusters of pattern within the input.
optimize their behaviour in the face of rewards and punishments.
2. Unsupervised training is employed in self-organizing neural
2. Reinforcement learning algorithms have been developed that are closely
networks.
related to methods of dynamic programming which is a general approach
3. This training does not require a teacher. to optimal control.
4. In this method of training, the input vectors of similar types are 3. Reinforcement learning phenomena have been observed in psychological
grouped without the use of training data to specify how a typical studies of animal behaviour, and in neurobiological investigations of
member of each group looks or to which group a member belongs. neuromodulation and addiction.
5. During training the neural network receives input patterns and Primary
organizes these patterns into categories. State (input) reinforcement
6. When new input pattern is applied, the neural network provides an vector signal
Environment Critic
output response indicating the class to which the input pattern
belongs. Heuristic
reinforcement
7. If a class cannot be found for the input pattern, a new class is
Actions signal
generated.
Learning
8. Though unsupervised training does not require a teacher, it requires system
certain guidelines to form groups.
9. Grouping can be done based on color, shape and any other property
Fig. 1.4.1. Block diagram of reinforcement learning.
of the object.
10. It is a method of machine learning where a model is fit to 4. The task of reinforcement learning is to use observed rewards to learn
observations. an optimal policy for the environment.
5. An optimal policy is a policy that maximizes the expected total reward.
11. It is distinguished from supervised learning by the fact that there is
no priori output. 6. Without some feedback about what is good and what is bad, the agent
will have no grounds for deciding which move to make.
12. In this, a data set of input objects is gathered.
7. The agents need to know that something good has happened when it
13. It treats input objects as a set of random variables. It can be used in
wins and that something bad has happened when it loses.
conjunction with Bayesian inference to produce conditional
probabilities. 8. This kind of feedback is called a reward or reinforcement.
Machine Learning Techniques 1–7 L (CS/IT-Sem-5) Introduction 1–8 L (CS/IT-Sem-5)
10. The robot’s task consists of finding out, through trial and error (or
Long Answer Type and Medium Answer Type Questions
success), which actions are good in a certain situation and which are
not.
11. In many cases humans learn in a very similar way.
Que 1.6. Write short note on well defined learning problem with
12. For example, when a child learns to walk, this usually happens without
instruction, rather simply through reinforcement. example.
13. Successful attempts at working are rewarded by forward progress, and
Answer
unsuccessful attempts are penalized by often painful falls.
14. Positive and negative reinforcement are also important factors in Well defined learning problem :
successful learning in school and in many sports. A computer program is said to learn from experience E with respect to some
15. In many complex domains, reinforcement learning is the only feasible class of tasks T and performance measure P, if its performance at tasks in T,
way to train a program to perform at high levels. as measured by P, improves with experience E.
Three features in learning problems :
Que 1.5. What are the steps used to design a learning system ?
1. The class of tasks (T)
Answer 2. The measure of performance to be improved (P)
3. The source of experience (E)
Steps used to design a learning system are :
For example :
1. Specify the learning task.
1. A checkers learning problem :
2. Choose a suitable set of training data to serve as the training experience.
a. Task (T) : Playing checkers.
3. Divide the training data into groups or classes and label accordingly.
b. Performance measure (P) : Percent of games won against
4. Determine the type of knowledge representation to be learned from the
opponents.
training experience.
c. Training experience (E) : Playing practice games against itself.
5. Choose a learner classifier that can generate general hypotheses from
the training data. 2. A handwriting recognition learning problem :
6. Apply the learner classifier to test data. a. Task (T) : Recognizing and classifying handwritten words within
images.
7. Compare the performance of the system with that of an expert human.
b. Performance measure (P) : Percent of words correctly classified.
Learner
c. Training experience (E) : A database of handwritten words with
given classifications.
Environment/
Experience Knowledge 3. A robot driving learning problem :
a. Task (T) : Driving on public four-lane highways using vision sensors.
Performance b. Performance measure (P) : Average distance travelled before an
element error (as judged by human overseer).
Fig. 1.5.1. c. Training experience (E) : A sequence of images and steering
commands recorded while observing a human driver.
Answer
Questions-Answers
Well defined learning problems role’s in machine learning :
1. Learning to recognize spoken words : Long Answer Type and Medium Answer Type Questions
a. Successful speech recognition systems employ machine learning in
some form.
b. For example, the SPHINX system learns speaker-specific strategies Que 1.8. Describe briefly the history of machine learning.
for recognizing the primitive sounds (phonemes) and words from
the observed speech signal. Answer
c. Neural network learning methods and methods for learning hidden A. Early history of machine learning :
Markov models are effective for automatically customizing to
1. In 1943, neurophysiologist Warren McCulloch and mathematician Walter
individual speakers, vocabularies, microphone characteristics,
Pitts wrote a paper about neurons, and how they work. They created a
background noise, etc.
model of neurons using an electrical circuit, and thus the neural network
2. Learning to drive an autonomous vehicle : was created.
a. Machine learning methods have been used to train computer 2. In 1952, Arthur Samuel created the first computer program which could
controlled vehicles to steer correctly when driving on a variety of learn as it ran.
road types.
3. Frank Rosenblatt designed the first artificial neural network in 1958,
b. For example, the ALYINN system has used its learned strategies to called Perceptron. The main goal of this was pattern and shape
drive unassisted at 70 miles per hour for 90 miles on public highways recognition.
among other cars.
4. In 1959, Bernard Widrow and Marcian Hoff created two models of neural
3. Learning to classify new astronomical structures : network. The first was called ADELINE, and it could detect binary
a. Machine learning methods have been applied to a variety of large patterns. For example, in a stream of bits, it could predict what the next
databases to learn general regularities implicit in the data. one would be. The second was called MADELINE, and it could eliminate
b. For example, decision tree learning algorithms have been used by echo on phone lines.
NASA to learn how to classify celestial objects from the second B. 1980s and 1990s :
Palomar Observatory Sky Survey. 1. In 1982, John Hopfield suggested creating a network which had
c. This system is used to automatically classify all objects in the Sky bidirectional lines, similar to how neurons actually work.
Survey, which consists of three terabytes of image data. 2. Use of back propagation in neural networks came in 1986, when
4. Learning to play world class backgammon : researchers from the Stanford psychology department decided to extend
a. The most successful computer programs for playing games such as an algorithm created by Widrow and Hoff in 1962. This allowed multiple
backgammon are based on machine learning algorithms. layers to be used in a neural network, creating what are known as ‘slow
learners’, which will learn over a long period of time.
b. For example, the world's top computer program for backgammon,
TD-GAMMON learned its strategy by playing over one million 3. In 1997, the IBM computer Deep Blue, which was a chess-playing
practice games against itself. computer, beat the world chess champion.
4. In 1998, research at AT&T Bell Laboratories on digit recognition resulted
in good accuracy in detecting handwritten postcodes from the US Postal
PART-3 Service.
History of ML, Introduction of Machine Learning C. 21st Century :
Approaches - (Artificial Neural Network, Clustering, Reinforcement 1. Since the start of the 21st century, many businesses have realised that
Learning, Decision Tree Learning, Bayesian Network, Support machine learning will increase calculation potential. This is why they
Vector Machine, Genetic Algorithm). are researching more heavily in it, in order to stay ahead of the
competition.
Machine Learning Techniques 1–11 L (CS/IT-Sem-5) Introduction 1–12 L (CS/IT-Sem-5)
2. Some large projects include : c. In speech recognition, a software application recognizes spoken
i. GoogleBrain (2012) words.
ii. AlexNet (2012) 3. Medical diagnosis :
iii. DeepFace (2014) a. ML provides methods, techniques, and tools that can help in solving
diagnostic and prognostic problems in a variety of medical domains.
iv. DeepMind (2014)
b. It is being used for the analysis of the importance of clinical
v. OpenAI (2015) parameters and their combinations for prognosis.
vi. ResNet (2015) 4. Statistical arbitrage :
vii. U-net (2015) a. In finance, statistical arbitrage refers to automated trading
strategies that are typical of a short-term and involve a large number
Que 1.9. Explain briefly the term machine learning. of securities.
b. In such strategies, the user tries to implement a trading algorithm
Answer
for a set of securities on the basis of quantities such as historical
1. Machine learning is an application of Artificial Intelligence (AI) that correlations and general economic variables.
provides systems the ability to automatically learn and improve from 5. Learning associations : Learning association is the process for
experience without being explicitly programmed. discovering relations between variables in large data base.
2. Machine learning focuses on the development of computer programs 6. Extraction :
that can access data.
a. Information Extraction (IE) is another application of machine
3. The primary aim is to allow the computers to learn automatically without learning.
human intervention or assistance and adjust actions accordingly.
b. It is the process of extracting structured information from
4. Machine learning enables analysis of massive quantities of data. unstructured data.
5. It generally delivers faster and more accurate results in order to identify
Que 1.11. What are the advantages and disadvantages of machine
profitable opportunities or dangerous risks.
6. Combining machine learning with AI and cognitive technologies can learning ?
make it even more effective in processing large volumes of information.
Answer
Que 1.10. What are the applications of machine learning ? Advantages of machine learning are :
1. Easily identifies trends and patterns :
Answer a. Machine learning can review large volumes of data and discover
specific trends and patterns that would not be apparent to humans.
Following are the applications of machine learning :
b. For an e-commerce website like Flipkart, it serves to understand
1. Image recognition :
the browsing behaviours and purchase histories of its users to help
a. Image recognition is the process of identifying and detecting an cater to the right products, deals, and reminders relevant to them.
object or a feature in a digital image or video.
c. It uses the results to reveal relevant advertisements to them.
b. This is used in many applications like systems for factory automation,
2. No human intervention needed (automation) : Machine learning
toll booth monitoring, and security surveillance.
does not require physical force i.e., no human intervention is needed.
2. Speech recognition :
3. Continuous improvement :
a. Speech Recognition (SR) is the translation of spoken words into
text. a. ML algorithms gain experience, they keep improving in accuracy
and efficiency.
b. It is also known as Automatic Speech Recognition (ASR), computer
b. As the amount of data keeps growing, algorithms learn to make
speech recognition, or Speech To Text (STT).
accurate predictions faster.
Machine Learning Techniques 1–13 L (CS/IT-Sem-5) Introduction 1–14 L (CS/IT-Sem-5)
4. Handling multi-dimensional and multi-variety data : Disadvantages of unsupervised machine learning algorithm :
a. Machine learning algorithms are good at handling data that are 1. The spectral classes do not necessarily represent the features on the
multi-dimensional and multi-variety, and they can do this in dynamic ground.
or uncertain environments.
2. It does not consider spatial relationships in the data.
Disadvantages of machine learning are :
3. It can take time to interpret the spectral classes.
1. Data acquisition :
Advantages of semi-supervised machine learning algorithm :
a. Machine learning requires massive data sets to train on, and these
should be inclusive/unbiased, and of good quality. 1. It is easy to understand.
2. Time and resources : 2. It reduces the amount of annotated data used.
a. ML needs enough time to let the algorithms learn and develop 3. It is stable, fast convergent.
enough to fulfill their purpose with a considerable amount of 4. It is simple.
accuracy and relevancy.
5. It has high efficiency.
b. It also needs massive resources to function.
Disadvantages of semi-supervised machine learning algorithm :
3. Interpretation of results :
a. To accurately interpret results generated by the algorithms. We 1. Iteration results are not stable.
must carefully choose the algorithms for our purpose. 2. It is not applicable to network level data.
4. High error-susceptibility : 3. It has low accuracy.
a. Machine learning is autonomous but highly susceptible to errors. Advantages of reinforcement learning algorithm :
b. It takes time to recognize the source of the issue, and even longer 1. Reinforcement learning is used to solve complex problems that cannot
to correct it. be solved by conventional techniques.
Que 1.12. What are the advantages and disadvantages of different 2. This technique is preferred to achieve long-term results which are very
types of machine learning algorithm ? difficult to achieve.
3. This learning model is very similar to the learning of human beings.
Answer Hence, it is close to achieving perfection.
Advantages of supervised machine learning algorithm : Disadvantages of reinforcement learning algorithm :
1. Classes represent the features on the ground. 1. Too much reinforcement learning can lead to an overload of states
2. Training data is reusable unless features change. which can diminish the results.
Disadvantages of supervised machine learning algorithm : 2. Reinforcement learning is not preferable for solving simple problems.
1. Classes may not match spectral classes. 3. Reinforcement learning needs a lot of data and a lot of computation.
2. Varying consistency in classes. 4. The curse of dimensionality limits reinforcement learning for real
3. Cost and time are involved in selecting training data. physical systems.
Advantages of unsupervised machine learning algorithm :
Que 1.13. Write short note on Artificial Neural Network (ANN).
1. No previous knowledge of the image area is required.
2. The opportunity for human error is minimised. Answer
3. It produces unique spectral classes. 1. Artificial Neural Networks (ANN) or neural networks are computational
4. Relatively easy and fast to carry out. algorithms that intended to simulate the behaviour of biological systems
composed of neurons.
Machine Learning Techniques 1–15 L (CS/IT-Sem-5) Introduction 1–16 L (CS/IT-Sem-5)
2. ANNs are computational models inspired by an animal’s central nervous 7. In clustering, the class labels are not present in training data simply
systems. because they are not known to cluster the data objects.
3. It is capable of machine learning as well as pattern recognition. 8. Hence, it is the type of unsupervised learning.
4. A neural network is an oriented graph. It consists of nodes which in the 9. For this reason, clustering is a form of learning by observation rather
biological analogy represent neurons, connected by arcs. than learning by examples.
5. It corresponds to dendrites and synapses. Each arc associated with a 10. There are certain situations where clustering is useful. These include :
weight at each node. a. The collection and classification of training data can be costly and
6. A neural network is a machine learning algorithm based on the model time consuming. Therefore it is difficult to collect a training data
of a human neuron. The human brain consists of millions of neurons. set. A large number of training samples are not all labelled. Then it
7. It sends and process signals in the form of electrical and chemical signals. is useful to train a supervised classifier with a small portion of
training data and then use clustering procedures to tune the classifier
8. These neurons are connected with a special structure known as synapses. based on the large, unclassified dataset.
Synapses allow neurons to pass signals.
b. For data mining, it can be useful to search for grouping among the
9. An Artificial Neural Network is an information processing technique. It
data and then recognize the cluster.
works like the way human brain processes information.
c. The properties of feature vectors can change over time. Then,
10. ANN includes a large number of connected processing units that work supervised classification is not reasonable. Because the test feature
together to process information. They also generate meaningful results
vectors may have completely different properties.
from it.
d. The clustering can be useful when it is required to search for good
Que 1.14. Write short note on clustering. parametric families for the class conditional densities, in case of
supervised classification.
Answer Que 1.15. What are the applications of clustering ?
1. Clustering is a division of data into groups of similar objects.
2. Each group or cluster consists of objects that are similar among themselves Answer
and dissimilar to objects of other groups as shown in Fig. 1.14.1. Following are the applications of clustering :
1. Data reduction :
a. In many cases, the amount of available data is very large and its
processing becomes complicated.
b. Cluster analysis can be used to group the data into a number of
clusters and then process each cluster as a single entity.
c. In this way, data compression is achieved.
2. Hypothesis generation :
Fig. 1.14.1. Clusters. a. In this case, cluster analysis is applied to a data set to infer hypothesis
3. A cluster is a collection of data objects that are similar to one another that concerns about the nature of the data.
within the same cluster and are dissimilar to the object in the other b. Clustering is used here to suggest hypothesis that must be verified
cluster. using other data sets.
4. Clusters may be described as connected regions of a multidimensional 3. Hypothesis testing : In this context, cluster analysis is used for the
space containing relatively high density points, separated from each verification of the validity of a specific hypothesis.
other by a region containing a relatively low density points.
4. Prediction based on groups :
5. From the machine learning perspective, clustering can be viewed as
a. In this case, cluster analysis is applied to the available data set and
unsupervised learning of concepts.
then the resulting clusters are characterized based on the
6. Clustering analyzes data objects without help of known class label. characteristics of the patterns by which they are formed.
Machine Learning Techniques 1–17 L (CS/IT-Sem-5) Introduction 1–18 L (CS/IT-Sem-5)
2. Calculate uncertainty of our dataset or Gini impurity or how much our 3. A Belief Network allows class conditional independencies to be defined
data is mixed up etc. between subsets of variables.
3. Generate list of all question which needs to be asked at that node. 4. It provides a graphical model of causal relationship on which learning
4. Partition rows into True rows and False rows based on each question can be performed.
asked. 5. We can use a trained Bayesian network for classification.
5. Calculate information gain based on Gini impurity and partition of data 6. There are two components that define a Bayesian belief network :
from previous step. a. Directed acyclic graph :
6. Update highest information gain based on each question asked. i. Each node in a directed acyclic graph represents a random
variable.
7 Update question based on information gain (higher information gain).
ii. These variable may be discrete or continuous valued.
8. Divide the node on question. Repeat again from step 1 until we get pure
node (leaf nodes). iii. These variables may correspond to the actual attribute given
in the data.
Que 1.21. What are the advantages and disadvantages of decision Directed acyclic graph representation : The following diagram shows a
tree method ? directed acyclic graph for six Boolean variables.
i. The arc in the diagram allows representation of causal
Answer knowledge.
Advantages of decision tree method are : ii. For example, lung cancer is influenced by a person’s family
1. Decision trees are able to generate understandable rules. history of lung cancer, as well as whether or not the person is
a smoker.
2. Decision trees perform classification without requiring computation.
3. Decision trees are able to handle both continuous and categorical
Family History Smoker
variables.
4. Decision trees provide a clear indication for the fields that are important
for prediction or classification.
Disadvantages of decision tree method are : Lung Cancer Emphysema
1. Decision trees are less appropriate for estimation tasks where the goal
is to predict the value of a continuous attribute.
2. Decision trees are prone to errors in classification problems with many
Positive Xray Dyspnea
class and relatively small number of training examples.
3. Decision tree are computationally expensive to train. At each node,
iii. It is worth noting that the variable Positive X-ray is independent
each candidate splitting field must be sorted before its best split can be
of whether the patient has a family history of lung cancer or
found.
that the patient is a smoker, given that we know the patient
4. In decision tree algorithms, combinations of fields are used and a search has lung cancer.
must be made for optimal combining weights. Pruning algorithms can
b. Conditional probability table :
also be expensive since many candidate sub-trees must be formed and
compared. The conditional probability table for the values of the variable
LungCancer (LC) showing each possible combination of the values
Que 1.22. Write short note on Bayesian belief networks. of its parent nodes, FamilyHistory (FH), and Smoker (S) is as follows :
FH,S FH,-S -FH,S -FH,S
Answer
LC 0.8 0.5 0.7 0.1
1. Bayesian belief networks specify joint conditional probability distributions.
2. They are also known as belief networks, Bayesian networks, or -LC 0.2 0.5 0.3 0.9
probabilistic networks.
Machine Learning Techniques 1–23 L (CS/IT-Sem-5) Introduction 1–24 L (CS/IT-Sem-5)
2. SVM is a supervised learning method that looks at data and sorts it into
Initial population
one of two categories.
3. An SVM outputs a map of the sorted data with the margins between the Selection
two as far apart as possible.
New population
4. Applications of SVM :
Old population
i. Text and hypertext classification
ii. Image classification Yes
Quit ?
iii. Recognizing handwritten characters
iv. Biological sciences, including protein classification
NO
Que 1.24. Explain genetic algorithm with flow chart.
Crossover
Answer
Genetic algorithm (GA) :
1. The genetic algorithm is a method for solving both constrained and Mutation
unconstrained optimization problems that is based on natural selection.
2. The genetic algorithm repeatedly modifies a population of individual
solutions.
End
3. At each step, the genetic algorithm selects individuals at random from
the current population to be parents and uses them to produce the Fig. 1.24.1.
children for the next generation.
4. Over successive generations, the population evolves toward an optimal
solution. PART-4
Flow chart : The genetic algorithm uses three main types of rules at each Issues in Machine Learning and Data Science Vs. Machine Learning.
step to create the next generation from the current population :
a. Selection rule : Selection rules select the individuals, called parents,
that contribute to the population at the next generation. Questions-Answers
b. Crossover rule : Crossover rules combine two parents to form children
for the next generation. Long Answer Type and Medium Answer Type Questions
c. Mutation rule : Mutation rules apply random changes to individual
parents to form children.
Que 1.25. Briefly explain the issues related with machine
learning.
Machine Learning Techniques 1–25 L (CS/IT-Sem-5) Introduction 1–26 L (CS/IT-Sem-5)
3. Clustering :
Answer
a. In clustering data is not labelled, but can be divided into groups
Issues related with machine learning are : based on similarity and other measures of natural structure in the
1. Data quality : data.
a. It is essential to have good quality data to produce quality ML b. For example, organising pictures by faces without names, where
the human user has to assign names to groups, like iPhoto on the
algorithms and models.
Mac.
b. To get high-quality data, we must implement data evaluation,
integration, exploration, and governance techniques prior to 4. Rule extraction :
a. In rule extraction, data is used as the basis for the extraction of
developing ML models.
propositional rules.
c. Accuracy of ML is driven by the quality of the data.
b. These rules discover statistically supportable relationships between
2. Transparency : attributes in the data.
a. It is difficult to make definitive statements on how well a model is
going to generalize in new environments. Que 1.27. Differentiate between data science and machine
3. Manpower : learning.
a. Manpower means having data and being able to use it. This does Answer
not introduce bias into the model.
b. There should be enough skill sets in the organization for software S. No. Data science Machine learning
development and data collection. 1. Data science is a concept used Machine learning is defined as
4. Other : to tackle big data and includes the practice of using algorithms
a. The most common issue with ML is people using it where it does data cleansing, preparation, to use data, learn from it and
not belong. and analysis. then forecast future trends for
that topic.
b. Every time there is some new innovation in ML, we see overzealous
engineers trying to use it where it’s not really necessary. 2. It includes vario us data It includes subset of Artificial
c. This used to happen a lot with deep learning and neural networks. operations. Intelligence.
d. Traceability and reproduction of results are two main issues. 3. Data science works by Machine learning uses efficient
so urcing, cleaning, and programs that can use data
Que 1.26. What are the classes of problem in machine learning ? processing data to extract without being explicitly told to
me aning out of it fo r do so.
Answer analytical purposes.
Common classes of problem in machine learning :
4. SAS, Tableau, Apache, Spark, Amazo n Le x, IBM Watso n
1. Classification : MATLAB are the tools used Studio, Microsoft Azure ML
a. In classification data is labelled i.e., it is assigned a class, for example, in data science. Studio are the tools used in ML.
spam/non-spam or fraud/non-fraud.
5. Data science deals with Machine learning uses statistical
b. The decision being modelled is to assign labels to new unlabelled
structured and unstructured models.
pieces of data.
data.
c. This can be thought of as a discrimination problem, modelling the
differences or similarities between groups. 6. Fraud de te ctio n and Recommendation systems such
2. Regression : he althcare analysis are as Spotify and Facial Recognition
examples of data science. are examples o f machine
a. Regression data is labelled with a real value rather than a label. learning.
b. The decision being modelled is what value to predict for new
unpredicted data.
Machine Learning Techniques 2–1 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–2 L (CS/IT-Sem-5)
2
PART-1
Regression, Linear Regression and Logistic Regression.
CONTENTS Answer
1. Regression is a statistical method used in finance, investing, and other
disciplines that attempts to determine the strength and character of the
Part-1 : Regression, Linear Regression ................ 2–2L to 2–4L relationship between one dependent variable (usually denoted by Y)
and Logistic Regression and a series of other variables (known as independent variables).
Part-2 : Bayesian Learning, Bayes ...................... 2–4L to 2–19L 2. Regression helps investment and financial managers to value assets
Theorem, Concept Learning, and understand the relationships between variables, such as commodity
Bayes Optimal Classifier, Naive prices and the stocks of businesses dealing in those commodities.
Bayes Classifier, Bayesian There are two type of regression :
Belief Networks, EM Algorithm
a. Simple linear regression : It uses one independent variable to
Part-3 : Support Vector Machine, ...................... 2–20L to 2–24L explain or predict the outcome of dependent variable Y.
Introduction, Types of Support Y = a + bX + u
Vector Kernel - (Linear Kernel b. Multiple linear regression : It uses two or more independent
Polynomial Kernel, and Gaussian
variables to predict outcomes.
Kernel), Hyperplane-
(Decision Surface), Properties Y = a + b1X1 + b2X2 + b3X3 + ... + btXt + u
of SVM, and Issues in SVM Where :
Y = The variable we you are trying to predict (dependent variable).
X = The variable that we are using to predict Y (independent variable).
a = The intercept.
b = The slope.
u = The regression residual.
Que 2.2. Describe briefly linear regression.
Answer
1. Linear regression is a supervised machine learning algorithm where
the predicted output is continuous and has a constant slope.
2. It is used to predict values within a continuous range, (for example :
sales, price) rather than trying to classify them into categories (for
example : cat, dog).
Machine Learning Techniques 2–3 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–4 L (CS/IT-Sem-5)
1. Let 1, 2 be the two classes of the patterns. It is assumed that the a 14. From the Fig. 2.6.1, it is obvious that the errors are unavoidable. There
priori probabilities p(1) and p(2) are known. is a finite probability for an x to lie in the R2 region and at the same time
to belong in class 1. Then there is error in the decision.
2. Even if they are not known, they can easily be estimated from the
available training feature vectors. p ( x | ) p(x|1) p(x|2)
3. If N is total number of available training patterns and N1, N2 of them
belong to 1 and 2, respectively then p(1) N1/N and p(2) N2/N. Shade the part
4. The conditional probability density functions p(x| i), i = 1, 2 is also
assumed to be known which describes the distribution of the feature
vectors in each of the classes.
5. The feature vectors can take any value in the l-dimensional feature
space.
6. Density functions p(x|i) become probability and will be denoted by
p(x| i) when the feature vectors can take only discrete values. x
R1 x0 R2
7. Consider the conditional probability,
Fig. 2.6.1. Bayesian classifier for the case of two equiprobable classes.
p( x| i ) p(i )
p i | x = ...(2.6.1) 15. The total probability, P of committing a decision error for two
p(x)
equiprobable classes is given by,
where p(x) is the probability density function of x and for which we have
2 x0 1
1 1
p(x) = p x| i p i ...(2.6.2) Pe = p x| 2 dx p x|1 dx
i 1
2 2
x
0
Machine Learning Techniques 2–7 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–8 L (CS/IT-Sem-5)
which is equal to the total shaded area under the curves in Fig. 2.6.1. 12. In a classification task with M classes, 1, 2,..., M an unknown pattern,
represented by the feature vector x, is assigned to class i if p(i|x) >
Que 2.7. Explain how the decis ion error for Bayes ian
p(j|x) j i.
classification can be minimized.
Que 2.8. Consider the Bayesian classifier for the uniformly
Answer
distributed classes, where :
1. Bayesian classifier can be made optimal by minimizing the classification
1
error probability. , x [ a1 , a2 ]
P(x/w1) = a2 a1
2. In Fig. 2.7.1, it is observed that when the threshold is moved away from
x0, the corresponding shaded area under the curves always increases. 0 , muullion
3. Hence, we have to decrease this shaded area to minimize the error. 1
, x [ b1 , b2 ]
4. Let R 1 be the region of the feature space for 1 and R 2 be the P(x/w2) = b2 b1
corresponding region for 2.
0 , muullion
5. Then an error will be occurred if, x R1 although it belongs to 2 or if x Show th e cl assification resul ts for s ome values for a and b
R2 although it belongs to 1 i.e., (“muullion” means “otherwise”).
Pe = p(xR2 ,1) + p(xR1, 2) ...(2.7.1) Answer
6. Pe can be written as, Typical cases are presented in the Fig. 2.8.1.
Pe = p(xR2|1) p(1) + p(xR1|2) p(2) P(x|yj) P(x|yj)
1 1
1 1
= P(1) p( x |1 ) dx p( 2 ) p( x| 2 ) dx ...(2.7.2) a2 – a1 b2 – b 1 a2 – a1 b2 – b1
R R
2 1
7. Using the Baye’s rule,
a1 a2 b b2 a1 b1 a2 b2
1
=P p(1 | x) p( x)dx p( 2 | x) p( x) dx ...(2.7.3)
P(x|yj)
(a)
P(x|yj)
(b)
R R
2 1 1
1
8. The error will be minimized if the partitioning regions R1 and R2 of the b2 – b1 b2 – b 1
feature space are chosen so that 1 1
a2 – a1 a2 – a1
R1 : p(1|x) > p(2|x)
R2 : p(2|x) > p(1 |x) ...(2.7.4)
9. Since the union of the regions R1, R2 covers all the space, we have a1 b1 a2 a1 b1 b2 a2
b2
(c) (d)
p(1 | x) p(x)dx p(1 | x) p(x) dx = 1 ...(2.7.5) Fig. 2.8.1.
R1 R2
10. Combining equation (2.7.3) and (2.7.5), we get, Que 2.9. Define Bayes classifier. Explain how classification is
done by using Bayes classifier.
Pe = p(w1) ( p(1 x) p( 2 x)) p(x)dx ... (2.7.6)
R1 Answer
11. Thus, the probability of error is minimized if R1 is the region of space in 1. A Bayes classifier is a simple probabilistic classifier based on applying
which p(1|x) > p(2|x). Then R2 becomes region where the reverse is Bayes theorem (from Bayesian statistics) with strong (Naive)
true. independence assumptions.
Machine Learning Techniques 2–9 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–10 L (CS/IT-Sem-5)
P (x / C )dx C P P (x / C )dx
For example :
R = C11 P1 x 1 22 2 x 2
H1 H2
1. Let D be a training set of features and their associated class labels. Each
feature is represented by an n-dimensio nal attribute vector
C21 P1
P (x / C )dx C P P (x / C )dx
H2
x 1 12 2
H1
x 2
X = (x1, x2, ...., xn) depicting n measurements made on the feature from
n attributes, respectively A1, A2, ....., An.
where the various terms are defined as follows : 2. Suppose that there are m classes, C1, C2,..., Cm. Given a feature X, the
classifier will predict that X belongs to the class having the highest
Pi = Prior probability that the observation vector x is drawn from
posterior probability, conditioned on X. That is, classifier predicts that X
subspace Hi, with i = 1, 2, and P1 + P2 = 1
belongs to class Ci if and only if,
Cij = Cost of deciding in favour of class Ci represented by subspace Hi
when class Cj is true, with i, j = 1, 2 p(Ci|X) > p(Cj|X) for 1 j m, j i
Px (x/Ci) = Conditional probability density function of the random vector X Thus, we maximize p(Ci|X). The class Ci for which p(Ci|X) is maximized
is called the maximum posterior hypothesis. By Bayes theorem,
8. Fig. 2.9.1(a) depicts a block diagram representation of the Bayes classifier.
p( X |Ci ) p(Ci )
The important points in this block diagram are two fold : p(Ci|X) =
p(X)
a. The data processing in designing the Bayes classifier is confined
entirely to the computation of the likelihood ratio (x). 3. As p(X) is constant for all classes, only P(X| Ci) P(Ci) need to be
b. This computation is completely invariant to the values assigned to maximized. If the class prior probabilities are not known then it is
the prior probabilities and involved in the decision-making process. commo nly assume d that the classe s are e qually likely i.e.,
These quantities merely affect the values of the threshold x. p(C1) = p(C2) = .... p(Cm) and therefore p(X|Ci) is maximized. Otherwise
p(X|Ci) p(Ci) is maximized.
c. From a computational point of view, we find it more convenient to
work with logarithm of the likelihood ratio rather than the 4. i. Given data sets with many attributes, the computation of p(X|Ci)
likelihood ratio itself. will be extremely expensive.
ii. To reduce computation in evaluating p(X|Ci), the assumption of
class conditional independence is made.
Machine Learning Techniques 2–11 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–12 L (CS/IT-Sem-5)
iii. This presumes that the values of the attributes are conditionally
independent of one another, given the class label of the feature. Que 2.11. Let blue, green, and red be three classes of objects with
n prior probabilities given by P(blue) = 1/4, P(green) = 1/2, P(red) = 1/4.
Thus, p(X|Ci) = p( xk C i ) Let there be three types of objects pencils, pens, and paper. Let the
k 1 class-conditional probabilities of these objects be given as follows.
= p(x1|C2) p (x2|C2)x.... × p(xn|Ci) Use Bayes classifier to classify pencil, pen and paper.
iv. The probabilities p(x1|Ci), p(x2|Ci),...., p(xn|Ci) are easily estimated P(pencil/green) = 1/3 P(pen/green) = 1/2 P(paper/green) = 1/6
from the training feature. Here xk refers to the value of attribute P(pencil/blue) = 1/2 P(pen/blue) = 1/6 P(paper/blue) = 1/3
Ak for each attribute, it is checked whether the attribute is P(pencil/red) = 1/6 P(pen/red) = 1/3 P(paper/red) = 1/2
categorical or continuous valued.
Answer
v. For example, to compute p(X|Ci) we consider,
As per Bayes rule :
a. If Ak is categorical then p(xk|Ci) is the number of feature of
class Ci in D having the value xk for Ak divided by |Ci, D|, the P(pencil/ green) P(green)
P(green/pencil) =
number of features of class Ci in D. (P(pencil/ green) P(green) + P(pencil/ blue)
b. If Ak is continuous valued then continuous valued attribute is P(blue) + P(pencil/ red) P(red)
typically assumed to have a Gaussian distribution with a mean 1 1 1
and standard deviation , defined by, 3 2
= 6 = 0.5050
1 1 1 1 1 1 0.33
x 2
1 3 2 2 4 6 4
1 2 2
g(x) = e P(pencil/ blue) P(blue)
P(blue/pencil) =
2 (P(pencil/ green) P(green) + P(pencil/ blue)
so that p(xk|Ci) = g(xk). P(blue) + P(pencil/ red) P(red)
vi. There is a need to compute the mean and the standard deviation 1 1
of the value of attribute Ak for training set of class Ci. These
= 2 4 = 0.378
values are used to estimate p(xk|Ci). 0.33
vii. For example, let X = (35, Rs. 40,000) where A1 and A2 are the P(pencil/ red) P(red)
P(red/pencil) =
attributes age and income, respectively. Let the class label attribute (P(pencil/ red) P(red) + P(pencil/ blue)
be buys-computer. P(blue) + P(pencil/ green) P(green)
viii. The associated class label for X is yes (i.e., buys-computer = yes). 1 1 1
Let’s suppose that age has not been discretized and therefore exists
= 6 4 24 = 0.126
as a continuous valued attribute. 0.33 0.33
ix. Suppose that from the training set, we find that customer in D who Since, P(green/pencil) has the highest value therefore pencil belongs to
buy a computer are 38 ± 12 years of age. In other words, for attribute class green.
age and this class, we have = 38 and = 12. P(pen/ green) P(green)
P(green/pen) =
5. In order to predict the class label of X, p(X|Ci) p(Ci) is evaluated for each P(pen/ green) P(green) + P(pen/ blue)
class Ci. The classifier predicts that the class label of X is the class Ci, if P(blue) P(pen/ red) P(red)
and only if
1 1 1
p(X|Ci) P(Ci) > p(X|Cj) p(Cj) for 1 j m, j i,
= 2 2 4 = 0.666
The predicted class label is the class Ci for which p(X|Ci) P(Ci) is the 1 1 1 1 1 1 0.375
maximum. 2 2 6 4 3 4
Machine Learning Techniques 2–13 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–14 L (CS/IT-Sem-5)
P(pen/ blue)P(blue)
P(blue/pen) = Answer
P(pen/ green) P(green) + P(pen/ blue)
1. Naive Bayes model is the most common Bayesian network model used
P(blue) + P(pen/ red) P(red)
in machine learning.
1 1 1
2. Here, the class variable C is the root which is to be predicted and the
= 6 4 24 = 0.111 attribute variables Xi are the leaves.
0.375 0.375
3. The model is Naive because it assumes that the attributes are
P(pen/ red) P(red) conditionally independent of each other, given the class.
P(red/pen) =
P(pen/ green) P(green) + P(pen/ blue)
1
P(blue) + P(pen/ red) P(red)
Sprinkler Rain
Que 2.16. Write a short note on Bayesian network.
OR
Explain Bayesian network by taking an example. How is the Bayesian Wet grass
network powerful representation for uncertainty knowledge ?
Fig. 2.16.1.
Answer
1. A Bayesian network is a directed acyclic graph in which each node is Bayesian network possesses the following merits in uncertainty
annotated with quantitative probability information. knowledge representation :
2. The full specification is as follows : 1. Bayesian network can conveniently handle incomplete data.
i. A set of random variables makes up the nodes of the network 2. Bayesian network can learn the casual relation of variables. In data
variables may be discrete or continuous. analysis, casual relation is helpful for field knowledge understanding, it
can also easily lead to precise prediction even under much interference.
ii. A set of directed links or arrows connects pairs of nodes. If there is
an arrow from x to node y, x is said to be a parent of y. 3. The combination of bayesian network and bayesian statistics can take
full advantage of field knowledge and information from data.
iii. Each node x i has a co nditio nal pro bability distributio n
P(xi|parent(xi)) that quantifies the effect of parents on the node. 4. The combination of bayesian network and other models can effectively
avoid over-fitting problem.
iv. The graph has no directed cycles (and hence is a directed acyclic
graph or DAG).
Machine Learning Techniques 2–19 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–20 L (CS/IT-Sem-5)
Que 2.17.
3.17. Explain the role of prior probability and posterior
probability in bayesian classification.
PART-3
Support Vector Machine, Introduction, Types of Support
Answer Vector Kernel - (Linear Kernel Polynomial Kernel, and Gaussian
Role of prior probability : Kernel), Hyperplane : (Decision Surface), Properties
1. The prior probability is used to compute the probability of the event of SVM, and Issues in SVM.
before the collection of new data.
2. It is used to capture our assumptions / domain knowledge and is
independent of the data. Questions-Answers
3. It is the unconditional probability that is assigned before any relevant
evidence is taken into account. Long Answer Type and Medium Answer Type Questions
5. When d = 2, the polynomial kernel computes the 2-Dimensional 7. Thus for a space of n dimensions we have a hyperplane of n-1 dimensions
relationship between each pair of observations which help to find the separating it into two parts.
support vector classifier.
Que 2.24. What are the advantages and disadvantags of SVM ?
Que 2.22. Describe Gaussian Kernel (Radial Basis Function).
Answer
Answer
Advantages of SVM are :
1. RBF kernel is a function whose value depends on the distance from the 1. Guaranteed optimality : Owing to the nature of Convex Optimization,
origin or from some point. the solution will always be global minimum, not a local minimum.
2. Gaussian Kernel is of the following format : 2. The abundance of implementations : We can access it conveniently.
K(X1, X2, ) = exponent (– X1 – X 2 2 ) 3. SVM can be used for linearly separable as well as non-linearly separable
data. Linearly separable data pases hard margin whereas non-linearly
separable data poses a soft margin.
X1 – X 2 = Euclidean distance between X1 and X2
4. SVMs provide compliance to the semi-supervised learning models. It
Using the distance in the original space we calculate the dot product can be used in areas where the data is labeled as well as unlabeled. It
(similarity) of X1 and X2. only requires a condition to the minimization problem which is known
3. Following are the parameters used in Gaussain Kernel: as the transductive SVM.
a. C : Inverse of the strength of regularization. 5. Feature Mapping used to be quite a load on the computational complexity
of the overall training performance of the model. However, with the
Behavior : As the value of ‘c’ increases the model gets overfits.
help of Kernel Trick, SVM can carry out the feature mapping using the
As the value of ‘c’ decreases the model underfits. simple dot product.
b. : Gamma (used only for RBF kernel) Disadvantages of SVM :
Behavior : As the value of ‘’ increases the model gets overfits. 1. SVM does not give the best performance for handling text structures as
As the value of ‘’ decreases the model underfits. compared to other algorithms that are used in handling text data. This
leads to loss of sequential information and thereby, leading to worse
Que 2.23. Write short note on hyperplane (Decision surface). performance.
Machine Learning Techniques 2–23 L (CS/IT-Sem-5) Regression & Bayesian Learning 2–24 L (CS/IT-Sem-5)
2. SVM cannot return the probabilistic confidence value that is similar to 3. The 'C' parameter :
logistic regression. This does not provide much explanation as the a. This parameter controls the amount of regularization applied on
confidence of prediction is important in several applications. the data.
3. The choice of the kernel is perhaps the biggest limitation of the support b. Large values of C mean low regularization which in turn causes
vector machine. Considering so many kernels present, it becomes difficult the training data to fit very well (may cause overfitting).
to choose the right one for the data.
c. Lower values of C mean higher regularization which causes the
Que 2.25. Explain the properties of SVM. model to be more tolerant of errors (may lead to lower accuracy).
Answer
Following are the properties of SVM :
1. Flexibility in choosing a similarity function : Sparseness of
solution when dealing with large data sets only support vectors are used
to specify the separating hyperplane
2. Ability to handle large feature spaces : complexity does not depend
on the dimensionality of the feature space
3. Overfitting can be controlled by soft margin approach : A simple
convex optimization problem which is guaranteed to converge to a single
global solution
Answer
Parameters used in support vector classifier are :
1. Kernel :
a. Kernel, is selected based on the type of data and also the type of
transformation.
b. By default, the kernel is Radial Basis Function Kernel (RBF).
2. Gamma :
a. This parameter decides how far the influence of a single training
example reaches during transformation, which in turn affects how
tightly the decision boundaries end up surrounding points in the
input space.
b. If there is a small value of gamma, points farther apart are considered
similar.
c. So, more points are grouped together and have smoother decision
boundaries (may be less accurate).
d. Larger values of gamma cause points to be closer together (may
cause overfitting).
Machine Learning Techniques 3–1 L (CS/IT-Sem-5) Decision Tree Learning 3–2 L (CS/IT-Sem-5)
3
PART-1
Decision Tree Learning, Decision Tree Learning Algorithm,
Inductive Bias, Inductive Inference with Decision Trees.
Fig. 3.1.1.
4. Leaf / Terminal node : Nodes that do not split is called leaf or terminal
node.
5. Pruning : When we remove sub-nodes of a decision node, this process
is called pruning. This process is opposite to splitting process.
6. Branch / sub-tree : A sub section of entire tree is called branch or sub-
tree.
7. Parent and child node : A node which is divided into sub-nodes is
called parent node of sub-nodes where as sub-nodes are the child of
parent node.
Machine Learning Techniques 3–3 L (CS/IT-Sem-5) Decision Tree Learning 3–4 L (CS/IT-Sem-5)
Answer
1. Decision trees can be visualized, simple to understand and interpret. Sunny Overcost Rain
2. They require less data preparation whereas other techniques often
require data normalization, the creation of dummy variables and removal Yes
of blank values.
Humidity Wind
3. The cost of using the tree (for predicting data) is logarithmic in the
number of data points used to train the tree.
4. Decision trees can handle both categorical and numerical data whereas Strong Weak
High Normal
other techniques are specialized for only one type of variable.
5. Decision trees can handle multi-output problems. No Yes No Yes
6. Decision tree is a white box model i.e., the explanation for the condition Fig. 3.3.1.
can be explained easily by Boolean logic because there are two outputs.
For example yes or no. Que 3.4. Explain various decision tree learning algorithms.
7. Decision trees can be used even if assumptions are violated by the
dataset from which the data is taken. Answer
Various decision tree learning algorithms are :
Que 3.3. How can we express decision trees ?
1. ID3 (Iterative Dichotomiser 3) :
Answer i. ID3 is an algorithm used to generate a decision tree from a dataset.
1. Decision trees classify instances by sorting them down the tree from the ii. To construct a decision tree, ID3 uses a top-down, greedy search
root to leaf node, which provides the classification of the instance. through the given sets, where each attribute at every tree node is
tested to select the attribute that is best for classification of a given
2. An instance is classified by starting at the root node of the tree, testing
set.
the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute as shown in Fig. 3.3.1. iii. Therefore, the attribute with the highest information gain can be
selected as the test attribute of the current node.
3. This process is then repeated for the subtree rooted at the new node.
iv. In this algorithm, small decision trees are preferred over the larger
4. The decision tree in Fig. 3.3.1 classifies a particular morning according
ones. It is a heuristic algorithm because it does not construct the
to whether it is suitable for playing tennis and returning the classification
smallest tree.
associated with the particular leaf.
v. For building a decision tree model, ID3 only accepts categorical
5. For example, the instance
attributes. Accurate results are not given by ID3 when there is
(Outlook = Rain, Temperature = Hot, Humidity = High, Wind = Strong) noise and when it is serially implemented.
would be sorted down the left most branch of this decision tree and vi. Therefore data is preprocessed before constructing a decision tree.
would therefore be classified as a negative instance.
vii. For constructing a decision tree information gain is calculated for
6. In other words, decision tree represent a disjunction of conjunctions of each and every attribute and attribute with the highest information
constraints on the attribute values of instances. gain becomes the root node. The rest possible values are denoted
(Outlook = Sunny Humidity = Normal) (Outlook = Overcast) by arcs.
(Outlook = Rain Wind = Weak) viii. All the outcome instances that are possible are examined whether
they belong to the same class or not. For the instances of the same
class, a single name is used to denote the class otherwise the
instances are classified on the basis of splitting attribute.
Machine Learning Techniques 3–5 L (CS/IT-Sem-5) Decision Tree Learning 3–6 L (CS/IT-Sem-5)
2. C4.5 : 2. For making a decision, only one attribute is tested at an instant thus
i. C4.5 is an algorithm used to generate a decision tree. It is an extension consuming a lot of time.
of ID3 algorithm. 3. Classifying the continuous data may prove to be expensive in terms of
ii. C4.5 generates decision trees which can be used for classification computation, as many trees have to be generated to see where to break
and therefore C4.5 is referred to as statistical classifier. the continuous sequence.
iii. It is better than the ID3 algorithm because it deals with both 4. It is overly sensitive to features when given a large number of input
continuous and discrete attributes and also with the missing values values.
and pruning trees after construction. Advantages of C4.5 algorithm :
iv. C5.0 is the commercial successor of C4.5 because it is faster, memory 1. C4.5 is easy to implement.
efficient and used for building smaller decision trees. 2. C4.5 builds models that can be easily interpreted.
v. C4.5 performs by default a tree pruning process. This leads to the 3. It can handle both categorical and continuous values.
formation of smaller trees, simple rules and produces more intuitive
interpretations. 4. It can deal with noise and missing value attributes.
i. CART algorithm builds both classification and regression trees. 1. A small variation in data can lead to different decision trees when using
C4.5.
ii. The classification tree is constructed by CART through binary
2. For a small training set, C4.5 does not work very well.
splitting of the attribute.
Advantages of CART algorithm :
iii. Gini Index is used for selecting the splitting attribute.
1. CART can handle missing values automatically using proxy splits.
iv. The CART is also used for regression analysis with the help of
regression tree. 2. It uses combination of continuous/discrete variables.
v. The regression feature of CART can be used in forecasting a 3. CART automatically performs variable selection.
dependent variable given a set of predictor variable over a given 4. CART can establish interactions among variables.
period of time.
5. CART does not vary according to the monotonic transformation of
vi. CART has an average speed of processing and supports both predictive variable.
continuous and nominal attribute data.
Disadvantages of CART algorithm :
Que 3.5. What are the advantages and disadvantages of different 1. CART has unstable decision trees.
3. If all Examples are negative, return the single-node tree root, with label 8. The inductive bias of the candidate elimination algorithm is that it is
=– only able to classify a new piece of data if all the hypotheses contained
4. If Attributes is empty, return the single-node tree root, with label = within its version space give data the same classification.
most common value of target attribute in examples. 9. Hence, the inductive bias does impose a limitation on the learning method.
5. Otherwise begin Inductive system :
a. A the attribute from Attributes that best classifies Examples
Inductive system
b. The decision attribute for Root A Classification of
Candidate new instance or
c. For each possible value, Vi, of A, Training examples
elimination do not know
i. Add a new tree branch below root, corresponding to the test A algorithm
= Vi New instance
Using hypothesis
ii. Let Example Vi be the subset of Examples that have value Vi space H
for A
Fig. 3.9.1.
iii. If Example Vi is empty
a. Then below this new branch add a leaf node with label
= most common value of TargetAttribute in Examples Que 3.10. Explain inductive learning algorithm.
b. Else below this new branch add the sub-tree ID3 (Example
Vi , TargetAttribute, Attributes–{A}) Answer
6. End Inductive learning algorithm :
7. Return root. Step 1 : Divide the table ‘T’ containing m examples into n sub-tables
(t1, t2, ... tn). One table for each possible value of the class attribute (repeat
Que 3.9. Explain inductive bias with inductive system. steps 2-8 for each sub-table).
Step 2 : Initialize the attribute combination count j = 1.
Answer
Step 3 : For the sub-table on which work is going on, divide the attribute list
Inductive bias :
into distinct combinations, each combination with j distinct attributes.
1. Inductive bias refers to the restrictions that are imposed by the
assumptions made in the learning method. Step 4 : For each combination of attributes, count the number of occurrences
of attribute values that appear under the same combination of attributes in
2. For example, assuming that the solution to the problem of road safety unmarked rows of the sub-table under consideration, and at the same time,
can be expressed as a conjunction of a set of eight concepts. not appears under the same combination of attributes of other sub-tables.
3. This does not allow for more complex expressions that cannot be Call the first combination with the maximum number of occurrences the
expressed as a conjunction. max-combination MAX.
4. This inductive bias means that there are some potential solutions that Step 5 : If MAX = = null, increase j by 1 and go to Step 3.
we cannot explore, and not contained within the version space we
Step 6 : Mark all rows of the sub-table where working, in which the values
examine.
of MAX appear, as classified.
5. Order to have an unbiased learner, the version space would have to
contain every possible hypothesis that could possibly be expressed. Step 7 : Add a rule (IF attribute = “XYZ” THEN decision is YES/ NO) to R
(rule set) whose left-hand side will have attribute names of the MAX with
6. The solution that the learner produced could never be more general their values separated by AND, and its right hand side contains the decision
than the complete set of training data. attribute value associated with the sub-table.
7. In other words, it would be able to classify data that it had previously
Step 8 : If all rows are marked as classified, then move on to process another
seen (as the rote learner could) but would be unable to generalize in
sub-table and go to Step 2, else, go to Step 4. If no sub-tables are available,
order to classify new, unseen data.
exit with the set of rules obtained till then.
Machine Learning Techniques 3–11 L (CS/IT-Sem-5) Decision Tree Learning 3–12 L (CS/IT-Sem-5)
2. Multi-valued attributes :
Que 3.11. Which learning algorithms are used in inductive bias ?
a. When an attribute has many possible values, the information gain
measure gives an inappropriate indication of the attribute’s
Answer
usefulness.
Learning algorithm used in inductive bias are :
b. In the extreme case, we could use an attribute that has a different
1. Rote-learner : value for every example.
a. Learning corresponds to storing each observed training example in c. Then each subset of examples would be a singleton with a unique
memory. classification, so the information gain measure would have its
b. Subsequent instances are classified by looking them up in memory. highest value for this attribute, the attribute could be irrelevant or
c. If the instance is found in memory, the stored classification is useless.
returned. d. One solution is to use the gain ratio.
d. Otherwise, the system refuses to classify the new instance. 3. Continuous and integer valued input attributes :
e. Inductive bias : There is no inductive bias. a. Height and weight have an infinite set of possible values.
2. Candidate-elimination : b. Rather than generating infinitely many branches, decision tree
a. New instances are classified only in the case where all members of learning algorithms find the split point that gives the highest
the current version space agree on the classification. information gain.
b. Otherwise, the system refuses to classify the new, instance. c. Efficient dynamic programming methods exist for finding good
c. Inductive bias : The target concept can be represented in its split points, but it is still the most expensive part of real world
hypothesis space. decision tree learning applications.
3. FIND-S : 4. Continuous-valued output attributes :
a. This algorithm, finds the most specific hypothesis consistent with a. If we are trying to predict a numerical value, such as the price of
the training examples. a work of art, rather than discrete classifications, then we need a
b. It then uses this hypothesis to classify all subsequent instances. regression tree.
c. Inductive bias : The target concept can be represented in its b. Such a tree has a linear function of some subset of numerical
hypothesis space, and all instances are negative instances unless attributes, rather than a single value at each leaf.
the opposite is entailed by its other knowledge. c. The learning algorithm must decide when to stop splitting and
Que 3.12. Discuss the issues related to the applications of decision begin applying linear regression using the remaining attributes.
trees.
PART-3
Answer Instance-based Learning.
Issues related to the applications of decision trees are :
1. Missing data : Questions-Answers
a. When values have gone unrecorded, or they might be too expensive
to obtain. Long Answer Type and Medium Answer Type Questions
b. Two problems arise :
i. To classify an object that is missing from the test attributes.
ii. To modify the information gain formula when examples have Que 3.13. Write short note on instance-based learning.
unknown values for the attribute.
Machine Learning Techniques 3–13 L (CS/IT-Sem-5) Decision Tree Learning 3–14 L (CS/IT-Sem-5)
Answer
Questions-Answers
Functions of instance-based learning are :
1. Similarity function : Long Answer Type and Medium Answer Type Questions
a. This computes the similarity between a training instance i and the
instances in the concept description.
b. Similarities are numeric-valued.
Que 3.18. Describe K-Nearest Neighbour algorithm with steps.
2. Classification function :
a. This receives the similarity function’s results and the classification Answer
performance records of the instances in the concept description.
1. The KNN classification algorithm is used to decide the new instance
b. It yields a classification for i. should belong to which class.
3. Concept description updater : 2. When K = 1, we have the nearest neighbour algorithm.
a. This maintains records on classification performance and decides 3. KNN classification is incremental.
which instances to include in the concept description.
4. KNN classification does not have a training phase, all instances are
b. Inputs include i, the similarity results, the classification results, stored. Training uses indexing to find neighbours quickly.
and a current concept description. It yields the modified concept
5. During testing, KNN classification algorithm has to find K-nearest
description.
neighbours of a new instance. This is time consuming if we do exhaustive
Que 3.17. What are the advantages and disadvantages of instance- comparison.
based learning ? 6. K-nearest neighbours use the local neighborhood to obtain a prediction.
Algorithm : Let m be the number of training data samples. Let p be an
Answer unknown point.
Advantages of instance-based learning : 1. Store the training samples in an array of data points array. This means
1. Learning is trivial. each element of this array represents a tuple (x, y).
2. Works efficiently. 2. For i = to m :
3. Noise resistant. Calculate Euclidean distance d(arr[i], p).
4. Rich representation, arbitrary decision surfaces. 3 Make set S of K smallest distances obtained. Each of these distances
corresponds to an already classified data point.
5. Easy to understand.
4. Return the majority label among S.
Disadvantages of instance-based learning :
1. Need lots of data. Que 3.19. What are the advantages and disadvantages of K-nearest
2. Computational cost is high. neighbour algorithm ?
3. Restricted to x Rn.
Answer
4. Implicit weights of attributes (need normalization).
Advantages of KNN algorithm :
5. Need large space for storage i.e., require large memory.
1. No training period :
6. Expensive application time. a. KNN is called lazy learner (Instance-based learning).
Machine Learning Techniques 3–17 L (CS/IT-Sem-5) Decision Tree Learning 3–18 L (CS/IT-Sem-5)
b. It does not learn anything in the training period. It does not derive
any discriminative function from the training data.
c. In other words, there is no training period for it. It stores the
training dataset and learns from it only at the time of making real
time predictions.
d. This makes the KNN algorithm much faster than other algorithms
X
that require training for example, SVM, Linear Regression etc.
Fig. 3.20.1.
2. Since the KNN algorithm requires no training before making predictions,
new data can be added seamlessly which will not impact the accuracy of 6. The LOESS (Locally Estimated Scatterplot Smoothing) model performs
the algorithm. a linear regression on points in the data set, weighted by a kernel
centered at x.
3. KNN is very easy to implement. There are only two parameters required
to implement KNN i.e., the value of K and the distance function (for 7. The kernel shape is a design parameter for which the original LOESS
example, Euclidean). model uses a tricubic kernel :
Disadvantages of KNN : hi(x) = h(x – xi) = exp(– k(x – xi)2),
1. Does not work well with large dataset : In large datasets, the cost of where k is a smoothing parameter.
calculating the distance between the new point and each existing points 8. For brevity, we will drop the argument x for hi(x), and define n = ihi.
is huge which degrades the performance of the algorithm. We can then write the estimated means and covariances as :
2. Does not work well with high dimensions : The KNN algorithm i hi xi 2 i hi ( xi – x )2 h ( x – x )( yi – y )
does not work well with high dimensional data because with large number x = , x , xy i i i
n n n
of dimensions, it becomes difficult for the algorithm to calculate the
i hi yi 2 i hi ( yi – y ) 2
2
distance in each dimension.
y = , y , 2y / x 2y – xy2
3. Need feature scaling : We need to do feature scaling (standardization n n x
and normalization) before applying KNN algorithm to any dataset. If we 9. We use the data covariances to express the conditional expectations and
do not do so, KNN may generate wrong predictions. their estimated variances :
4. Sensitive to noisy data, missing values and outliers : KNN is
xy 2y / x ( x – x )2 ( xi – x ) 2
2 i h
sensitive to noise in the dataset. We need to manually represent missing
ŷ = y ( x – x ), h2 2
2x 2x
2 i
values and remove outliers. x n i i
Answer
Kernel too wide - includes region
1. Model-based methods, such as neural networks and the mixture of Kernel just right
Gaussians, use the data to build a parameterized model. Kernel too narrow - excludes some of linear region
2. After training, the model is used for predictions and the data are generally X
Fig. 3.20.2.
discarded.
3. In contrast, memory-based methods are non-parametric approaches Que 3.21. Explain Radial Basis Function (RBF).
that explicitly retain the training data, and use it each time a prediction
needs to be made.
Answer
4. Locally Weighted Regression (LWR) is a memory-based method that
1. A Radial Basis Function (RBF) is a function that assigns a real value to
performs a regression around a point using only training data that are
local to that point. each input from its domain (it is a real-value function), and the value
produced by the RBF is always an absolute value i.e., it is a measure of
5. LWR was suitable for real-time control by constructing an LWR-based distance and cannot be negative.
system that learned a difficult juggling task.
Machine Learning Techniques 3–19 L (CS/IT-Sem-5) Decision Tree Learning 3–20 L (CS/IT-Sem-5)
2. Euclidean distance (the straight-line distance) between two points in where n is the number of neurons in the hidden layer, ci is the center
Euclidean space is used. vector for neuron i and ai is the weight of neuron i in the linear output
3. Radial basis functions are used to approximate functions, such as neural neuron.
networks acts as function approximators. 4. Functions that depend only on the distance from a center vector are
4. The following sum represents a radial basis function network : radially symmetric about that vector.
N 5. In the basic form all inputs are connected to each hidden neuron.
y(x) = w
i1
i ( x – xi ) ,
6. The radial basis function is taken to be Gaussian
5. The radial basis functions act as activation functions.
6. The approximant y(x) is differentiable with respect to the weights which ( x – ci ) = exp – x – ci 2
are learned using iterative update methods common among neural 7. The Gaussian basis functions are local to the center vector in the sense
networks. that
Que 3.22. Explain the architecture of a radial basis function
lim ( x – ci ) = 0
x
network.
i.e., changing parameters of one neuron has only a small effect for input
Answer values that are far away from the center of that neuron.
1. Radial Basis Function (RBF) networks have three layers : an input
8. Given certain mild conditions on the shape of the activation function,
layer, a hidden layer with a non-linear RBF activation function and a RBF networks are universal approximators on a compact subset of Rn.
linear output layer.
9. This means that an RBF network with enough hidden neurons can
2. The input can be modeled as a vector of real numbers x Rn.
approximate any continuous function on a closed, bounded set with
3. The output of the network is then a scalar function of the input vector, arbitrary precision.
: Rn R, and is given by 10. The parameters ai, ci, , and are determined in a manner that optimizes
N
the fit between and the data.
(x) = a
i1
i ( x – ci )
PART-5
Output y Case-based Learning.
Linear weights
Questions-Answers
Radial basis Long Answer Type and Medium Answer Type Questions
functions
Weights
Que 3.23. Write short note on case-based learning algorithm.
Input x
Answer
Fig. 3.22.1. Architecture of a radial basis function network. An input 1. Case-Based Learning (CBL) algorithms contain an input as a sequence
vector x is used as input to all radial basis functions, each with different of training cases and an output concept description, which can be used
parameters. The output of the network is a linear combination of the
to generate predictions of goal feature values for subsequently presented
outputs from radial basis functions.
cases.
Machine Learning Techniques 3–21 L (CS/IT-Sem-5) Decision Tree Learning 3–22 L (CS/IT-Sem-5)
4. There is no simple way they can process symbolic valued feature values. R
e
Que 3.24. What are the functions of case-based learning algorithm ? u
s
Retain e
Answer
Functions of case-based learning algorithm are : Revise
1. Pre-processor : This prepares the input for processing (for example,
normalizing the range of numeric-valued features to ensure that they
are treated with equal importance by the similarity function, formatting Confirmed Proposed
the raw input into a set of cases). solution solution
4. Retain the new solution as a part of a new case. 5. Suitability for sequential problem solving :
a. Sequential tasks, like these encountered reinforcement learning
Que 3.26. What are the benefits of CBL as a lazy problem solving problems, benefit from the storage of history in the form of sequence
method ? of states or procedures.
The benefits of CBL as a lazy Problem solving method are : 6. Ease of explanation :
1. Ease of knowledge elicitation : a. The results of a CBL system can be justified based upon the similarity
of the current problem to the retrieved case.
a. Lazy methods can utilise easily available case or problem instances
instead of rules that are difficult to extract. b. CBL are easily traceable to precedent cases, it is also easier to
b. So, classical knowledge engineering is replaced by case acquisition analyse failures of the system.
and structuring. 7. Ease of maintenance : This is particularly due to the fact that CBL
2. Absence of problem-solving bias : systems can adapt to many changes in the problem domain and the
relevant environment, merely by acquiring.
a. Cases can be used for multiple problem-solving purposes, because
they are stored in a raw form. Que 3.27. What are the limitations of CBL ?
b. This in contrast to eager methods, which can be used merely for
the purpose for which the knowledge has already been compiled. Answer
3. Incremental learning : Limitations of CBL are :
a. A CBL system can be put into operation with a minimal set solved 1. Handling large case bases :
cases furnishing the case base. a. High memory / storage requirements and time-consuming retrieval
b. The case base will be filled with new cases increasing the system’s accompany CBL systems utilising large case bases.
problem-solving ability. b. Although the order of both is linear with the number of cases,
c. Besides augmentation of the case base, new indexes and clusters these problems usually lead to increased construction costs and
categories can be created and the existing ones can be changed. reduced system performance.
d. This in contrast requires a special training period whenever c. These problems are less significant as the hardware components
informatics extraction (knowledge generalisation) is performed. become faster and cheaper.
e. Hence, dynamic on-line adaptation a non-rigid environment is 2. Dynamic problem domains :
possible. a. CBL systems may have difficulties in handling dynamic problem
4. Suitability for complex and not-fully formalised solution spaces : domains, where they may be unable to follow a shift in the way
problems are solved, since they are strongly biased towards what
a. CBL systems can applied to an incomplete model of problem domain,
has already worked.
implementation involves both to identity relevant case features
and to furnish, possibly a partial case base, with proper cases. b. This may result in an outdated case base.
b. Lazy approaches are appropriate for complex solution spaces than 3. Handling noisy data :
eager approaches, which replace the presented data with a. Parts of the problem situation may be irrelevant to the problem
abstractions obtained by generalisation. itself.
Machine Learning Techniques 3–25 L (CS/IT-Sem-5) Decision Tree Learning 3–26 L (CS/IT-Sem-5)
b. Unsuccessful assessment of such noise present in a problem c. There is Association-based storage and retrieval.
situation currently imposed on a CBL system may result in the 2. Induction : Machine learning use specific examples to reach general
same problem being unnecessarily stored numerous times in the conclusions.
case base because of the difference due to the noise.
3. Clustering : Clustering is a task of grouping a set of objects in such a
c. In turn this implies inefficient storage and retrieval of cases. way that objects in the same group are similar to each other than to
4. Fully automatic operation : those in other group.
a. In a CBL system, the problem domain is not fully covered. 4. Analogy : Determine correspondence between two diffe rent
representations.
b. Hence, some problem situations can occur for which the system
has no solution. 5. Discovery : Unsupervised i.e., specific goal not given.
6. Genetic algorithms :
c. In such situations, CBL systems expect input from the user.
a. Genetic algorithms are stochastic search algorithms which act on a
Que 3.28. What are the applications of CBL ? population of possible solutions.
b. They are probabilistic search methods means that the states which
Answer they explore are not determined solely by the properties of the
Applications of CBL : problems.
1. Interpretation : It is a process of evaluating situations / problems in 7. Reinforcement :
some context (For example, HYPO for interpretation of patent laws a. In reinforcement only feedback (positive or negative reward) given
KICS for interpretation of building regulations, LISSA for interpretation at end of a sequence of steps.
of non-destructive test measurements).
b. Requires assigning reward to steps by solving the credit assignment
2. Classification : It is a process of explaining a number of encountered problem which steps should receive credit or blame for a final result.
symptoms (For example, CASEY for classification of auditory
impairments, CASCADE for classification of software failures, PAKAR Que 3.30. Briefly explain the inductive learning problem.
for causal classification of building defects, ISFER for classification of
facial expressions into user defined interpretation categories. Answer
3. Design : It is a process of satisfying a number of posed constraints (For Inductive learning problem are :
example, JULIA for meal planning, CLAVIER for design of optimal 1. Supervised versus unsupervised learning :
layouts of composite airplane parts, EADOCS for aircraft panels design).
a. We want to learn an unknown function f(x) = y, where x is an input
4. Planning : It is a process of arranging a sequence of actions in time example and y is the desired output.
(For example, BOLERO for building diagnostic plans for medical patients,
TOTLEC for manufacturing planning). b. Supervised learning implies we are given a set of (x, y) pairs by a
teacher.
5. Advising : It is a process of resolving diagnosed problems (For example,
c. Unsupervised learning means we are only given the xs.
DECIDER for advising students, HOMER).
d. In either case, the goal is to estimate f.
Que 3.29. What are major paradigms of machine learning ? 2. Concept learning :
a. Given a set of examples of some concept/class/category, determine
Answer
if a given example is an instance of the concept or not.
Major paradigms of machine learning are :
b. If it is an instance, we call it a positive example.
1. Rote Learning : c. If it is not, it is called a negative example.
a. There is one-to-one mapping from inputs to stored representation.
b. Learning by memorization.
Machine Learning Techniques 3–27 L (CS/IT-Sem-5) Machine Learning Techniques 4–1 L (CS/IT-Sem-5)
4
a. Given a training set of positive and negative examples of a concept,
construct a description that will accurately classify whether future
examples are positive or negative. Artificial Neural
b. That is, learn some good estimate of function f given a training set
{(x1, y1), (x2, y2), ..., (xn, yn)} where each yi is either + (positive) or
Network and
– (negative).
Deep Learning
CONTENTS
Part-1 : Artificial Neural Network, ...................... 4–2L to 4–11L
Perceptron’s, Multilayer
Perceptron, Gradient Descent
and the Delta Rule
Artifical Neural Network, Perceptron’s Multilayer Perceptron, 5. ANNs can bear long training times depending on factors such as the
number of weights in the network, the number of training examples
Gradient Descent and the Delta Rule.
considered, and the settings of various learning algorithm parameters.
Disadvantages of Artificial Neural Networks (ANN) :
Questions-Answers 1. Hardware dependence :
a. Artificial neural networks require processors with parallel processing
Long Answer Type and Medium Answer Type Questions power, by their structure.
b. For this reason, the realization of the equipment is dependent.
2. Unexplained functioning of the network :
Que 4.1. Describe Artificial Neural Network (ANN) with different a. This is the most important problem of ANN.
layers. b. When ANN gives a probing solution, it does not give a clue as to
why and how.
Answer
c. This reduces trust in the network.
Artificial Neural Network : Refer Q. 1.13, Page 1–14L, Unit-1.
3. Assurance of proper network structure :
A neural network contains the following three layers :
a. There is no specific rule for determining the structure of artificial
a. Input layer : The activity of the input units represents the raw neural networks.
information that can feed into the network.
b. The appropriate network structure is achieved through experience
b. Hidden layer : and trial and error.
i. Hidden layer is used to determine the activity of each hidden 4. The difficulty of showing the problem to the network :
unit.
a. ANNs can work with numerical information.
ii. The activities of the input units and the weights depend on the
connections between the input and the hidden units. b. Problems have to be translated into numerical values before being
introduced to ANN.
iii. There may be one or more hidden layers.
c. The display mechanism to be determined will directly influence the
c. Output layer : The behaviour of the output units depends on the performance of the network.
activity of the hidden units and the weights between the hidden
and output units. d. This is dependent on the user’s ability.
5. The duration of the network is unknown :
Que 4.2. What are the advantages and disadvantage of Artificial a. The network is reduced to a certain value of the error on the
Neural Network ? sample means that the training has been completed.
b. This value does not give us optimum results.
Answer
Advantages of Artificial Neural Networks (ANN) : Que 4.3. What are the characteristics of Artificial Neural
1. Problems in ANN are represented by attribute-value pairs. Network ?
2. ANNs are used for problems having the target function, output may be Answer
discrete-valued, real-valued, or a vector of several real or discrete-valued
attributes. Characteristics of Artificial Neural Network are :
1. It is neurally implemented mathematical model.
3. ANNs learning methods are quite robust to noise in the training data.
The training examples may contain errors, which do not affect the final 2. It contains large number of interconnected processing elements called
output. neurons to do all the operations.
Artificial Neural Network & Deep Learning 4–4 L (CS/IT-Sem-5) Machine Learning Techniques 4–5 L (CS/IT-Sem-5)
3. Information stored in the neurons is basically the weighted linkage of b. It is a typical task because of the characterization of “non-face”
neurons. images.
4. The input signals arrive at the processing elements through connections c. However, if a neural network is well trained, then it can be divided
and connecting weights. into two classes namely images having faces and images that do not
5. It has the ability to learn, recall and generalize from the given data by have faces.
suitable assignment and adjustment of weights.
6. The collective behaviour of the neurons describes its computational Que 4.5. Explain different types of neuron connection with
power, and no single neuron carries specific information. architecture.
Que 4.4. Explain the application areas of artificial neural network. Answer
Different types of neuron connection are :
Answer
1. Single-layer feed forward network :
Application areas of artificial neural network are :
a. In this type of network, we have only two layers i.e., input layer
1. Speech recognition : and output layer but input layer does not count because no
a. Speech occupies a prominent role in human-human interaction. computation is performed in this layer.
b. Therefore, it is natural for people to expect speech interfaces with b. Output layer is formed when different weights are applied on input
computers. nodes and the cumulative effect per node is taken.
c. In the present era, for communication with machines, humans still c. After this the neurons collectively give the output layer to compute
need sophisticated languages which are difficult to learn and use. the output signals.
d. To ease this communication barrier, a simple solution could be
Input layer Output layer
communication in a spoken language that is possible for the machine
to understand. x1 w11 y1
e. Hence, ANN is playing a major role in speech recognition.
w12
2. Character recognition : w21
a. It is a problem which falls under the general area of Pattern
Recognition. x2 w22 y2
b. Many neural networks have been developed for automatic wn1
recognition of handwritten characters, either letters or digits. w1m
3. Signature verification application : w2m
a. Signatures are useful ways to authorize and authenticate a person wn2
in legal transactions.
b. Signature verification technique is a non-vision based technique. xn wnm ym
c. For this application, the first approach is to extract the feature or
rather the geometrical feature set representing the signature.
2. Multilayer feed forward network :
d. With these feature sets, we have to train the neural networks
a. This layer has hidden layer which is internal to the network and
using an efficient neural network algorithm.
has no direct contact with the external layer.
e. This trained neural network will classify the signature as being
genuine or forged under the verification stage. b. Existence of one or more hidden layers enables the network to be
computationally stronger.
4. Human face recognition :
c. There are no feedback connections in which outputs of the model
a. It is one of the biometric methods to identify the given face. are fed back into itself.
Artificial Neural Network & Deep Learning 4–6 L (CS/IT-Sem-5) Machine Learning Techniques 4–7 L (CS/IT-Sem-5)
w12 v12
w21 v21
w22 v22 w22
x2 y2 z2 x2 y2
wn1 v1m
w1m vk1
wn2 v2m
w2m vk2 wnm
xn ym
xn wnm yk vmk zm
5. Multilayer recurrent network :
a. In this type of network, processing element output can be directed
3. Single node with its own feedback : to the processing element in the same layer and in the preceding
a. When outputs can be directed back as inputs to the same layer or layer forming a multilayer recurrent network.
preceding layer nodes, then it results in feedback networks. b. They perform the same task for every element of a sequence, with
b. Recurrent networks are feedback networks with closed loop. the output being depended on the previous computations. Inputs
Fig. 4.5.1 shows a single recurrent network having single neuron are not needed at each time step.
with feedback to itself. c. The main feature of a multilayer recurrent neural network is its
Output hidden state, which captures information about a sequence.
Input
w11
x1 y1 v11 z1
v
v21 12
w22 v22
x2 y2 z2
v31
Feedback vk3
v3m
Fig. 4.5.1. vk3
Que 4.11. Explain delta rule. Explain generalized delta learning PART-2
rule (error backpropagation learning rule). Multilayer Network, Derivation of Back Propagation Algorithm,
Answer
Generalization.
Delta rule :
1. The delta rule is specialized version of backpropagation’s learning rule Questions-Answers
that uses single layer neural networks.
Long Answer Type and Medium Answer Type Questions
2. It calculates the error between calculated output and sample output
data, and uses this to create a modification to the weights, thus
implementing a form of gradient descent.
Que 4.12. Write short note on backpropagation algorithm.
Generalized delta learning rule (Error backpropagation learning) :
In generalized delta learning rule (error backpropagation learning). We Answer
are given the training set :
1. Backpropagation is an algorithm used in the training of feedforward
{x1, y1), ..., (xk, yk) neural networks for supervised learning.
where xk = [xk1, ... xnk] and yk R, k = 1, ..., K. 2. Backpropagation efficiently computes the gradient of the loss function
Step 1 : > 0, Emax > 0 are chosen. with respect to the weights of the network for a single input-output
Step 2 : Weights w are initialized at small random values, k = 1, and the example.
running error E is set to 0. 3. This makes it feasible to use gradient methods for training multi-layer
networks, updating weights to minimize loss, we use gradient descent
or variants such as stochastic gradient descent.
Artificial Neural Network & Deep Learning 4–12 L (CS/IT-Sem-5) Machine Learning Techniques 4–13 L (CS/IT-Sem-5)
4. The backpropagation algorithm works by computing the gradient of the 9. The goal of the perceptron is to correctly classify the set of externally
loss function with respect to each weight by the chain rule, iterating applied input x1, x2, ...… xm into one of two classes G1 and G2.
backwards one layer at a time from the last layer to avoid redundant 10. The decision rule for classification is that if output y is +1 then assign the
calculations of intermediate terms in the chain rule; this is an example point represented by input x1, x2, ……. xm to class G1 else y is –1 then
of dynamic programming. assign to class G2.
5. The term backpropagation refers only to the algorithm for computing 11. In Fig. 4.13.2, if a point (x1, x2) lies below the boundary lines is assigned
the gradient, but it is often used loosely to refer to the entire learning to class G2 and above the line is assigned to class G1. Decision boundary
algorithm, also including how the gradient is used, such as by stochastic is calculated as :
gradient descent. w1x1 + w2x2 + b = 0
6. Backpropagation generalizes the gradient computation in the delta rule, x2
Decision boundary
which is the single-layer version of backpropagation, and is in turn w1x1 + w2x2 + b = 0
generalized by automatic differentiation, where backpropagation is a
special case of reverse accumulation (reverse mode). Glass G2 Glass G1
Answer
1. The perceptron is the simplest form of a neural network used for
classification of patterns said to be linearly separable. Fig. 4.13.2.
2. It consists of a single neuron with adjustable synaptic weights and bias. 12. There are two decision regions separated by a hyperplane defined as :
m
3. The perceptron build around a single neuron is limited for performing
pattern classification with only two classes. w x
i 1
i i b = 0
4. By expanding the output layer of perceptron to include more than one The synaptic weights w1, w2, …….. wm of the perceptron can be adapted
neuron, more than two classes can be classified. on an iteration by iteration basis.
5. Suppose, a perceptron have synaptic weights denoted by w1, w2, w3, ….. 13. For the adaption, an error-correction rule known as perceptron
wm. convergence algorithm is used.
6. The input applied to the perceptron are denoted by x1, x2, …… xm. 14. For a perceptron to function properly, the two classes G1 and G2 must be
7. The externally applied bias is denoted by b. linearly separable.
x1 Bias b 15. Linearly separable means, the pattern or set of inputs to be classified
w1 must be separated by a straight line.
V Output 16. Generalizing, a set of points in n-dimensional space are linearly separable
x2 y if there is a hyperplane of (n – 1) dimensions that separates the sets.
w2 Hand
Inputs
limiter
wm
s
as
as
cl
cl
1
G
1
G
xm
s
as
as
Fig. 4.13.1. Signal flow graph of the perceptron.
cl
cl
2
2
G
G
8. From the model, we find that the hard limiter input or induced local field
of the neuron as
m
(a) A pair of linearly (b) A pair of non-linearly
V = wi xi b separable patterns separable patterns
i 1 Fig. 4.13.3.
Artificial Neural Network & Deep Learning 4–14 L (CS/IT-Sem-5) Machine Learning Techniques 4–15 L (CS/IT-Sem-5)
Que 4.14. State and prove perceptron convergence theorem. Que 4.15. Explain multilayer perceptron with its architecture
and characteristics.
Answer
Statement : The Perceptron convergence theorem states that for any data Answer
set which is linearly separable the Perceptron learning rule is guaranteed to Multilayer perceptron :
find a solution in a finite number of steps. 1. The perceptrons which are arranged in layers are called multilayer
Proof : perceptron. This model has three layers : an input layer, output layer
1. To derive the error-correction learning algorithm for the perceptron. and hidden layer.
2. The perceptron convergence theorem used the synaptic weights w1, w2, 2. For the perceptrons in the input layer, the linear transfer function used
…. wm of the perceptron can be adapted on an iteration by iteration and for the perceptron in the hidden layer and output layer, the sigmoidal
basis. or squashed-S function is used.
3. The bias b(n) is treated as a synaptic weight driven by fixed input equal 3. The input signal propagates through the network in a forward direction.
to + 1. 4. On a layer by layer basis, in the multilayer perceptron bias b(n) is treated
x(n) = [+ 1, x1(n), x2(n), ..... xm(n)]T as a synaptic weight driven by fixed input equal to +1.
x(n) = [+1, x1(n), x2(n), ………. xm(n)]T
Where n denotes the iteration step in applying the algorithm.
where n denotes the iteration step in applying the algorithm.
4. Correspondingly, we define the weight vector as
Correspondingly, we define the weight vector as :
w(n) = [b(n), w1(n), w2(n) ......, wm(n)]T
w(n) = [b(n), w1(n), w2(n)……….., wm(n)]T
Accordingly, the linear combiner output is written in the compact form : 5. Accordingly, the linear combiner output is written in the compact form :
n
w (n) x (n)
m
v(n) =
i 0
i i = wT(n) x(n) V(n) = w (n) x (n)
i 0
i i = wT(n) × x(n)
b. Error signals : Error signal originates at an output neuron and c. If momentum factor is zero, the smoothening is minimum and the
propagates backward through the network. entire weight adjustment comes from the newly calculated change.
d. If momentum factor is one, new adjustment is ignored and previous
one is repeated.
e. Between 0 and 1 is a region where the weight adjustment is
smoothened by an amount proportional to the momentum factor.
Input Output f. The momentum factor effectively increases the speed of learning
signal signal without leading to oscillations and filters out high frequency
variations of the error surface in the weight space.
2. Learning coefficient :
Output layer a. A formula to select learning coefficient is :
Que 4.17.
3.17. Dis cus s s election of various parameters in (Weight change
– E without momentum)
Backpropagation Neural Network (BPN). W
[ W] n
Answer
[ W]n
Selection of various parameters in BPN :
1. Number of hidden nodes : [ W]n+1
(Momentum term)
a. The guiding criterion is to select the minimum nodes in the first
and third layer, so that the memory demand for storing the weights Fig. 4.17.1. Influence of momentum term on weight change.
can be kept minimum. 3. Sigmoidal gain :
b. The number of separable regions in the input space M, is a function a. When the weights become large and force the neuron to operate in
of the number of hidden nodes H in BPN and H = M – 1. a region where sigmoidal function is very flat, a better method of
c. When the number of hidden nodes is equal to the number of training coping with network paralysis is to adjust the sigmoidal gain.
patterns, the learning could be fastest. b. By decreasing this scaling factor, we effectively spread out sigmoidal
d. In such cases, BPN simply remembers training patterns losing all function on wide range so that training proceeds faster.
generalization capabilities. 4. Local minima :
e. Hence, as far as generalization is concerned, the number of hidden a. One of the most practical solutions involves the introduction of a
nodes should be small compared to the number of training patterns shock which changes all weights by specific or random amounts.
with help of Vapnik Chervonenkis dimension (VCdim) of probability b. If this fails, then the most practical solution is to rerandomize the
theory. weights and start the training all over.
f. We can estimate the selection of number of hidden nodes for a
given number of training patterns as number of weights which is
equal to I1 * I2 + I2 * I3, where I1 and I3 denote input and output PART-3
nodes and I2 denote hidden nodes. Unspervised Learning, SOM Algorithm and its Variants.
g. Assume the training samples T to be greater than VCdim. Now if
we accept the ratio 10 : 1
Questions-Answers
I2
10 * T =
( I1 I3 ) Long Answer Type and Medium Answer Type Questions
10T
I2 =
( I1 I3 ) Que 4.18. Write short note on unsupervised learning.
Which yields the value for I2.
2. Momentum coefficient : Answer
a. To reduce the training time we use the momentum factor because 1. Unsupervised learning is the training of machine using information
it enhances the training process. that is neither classified nor labeled and allowing the algorithm to act on
that information without guidance.
b. The influences of momentum on weight change is
Artificial Neural Network & Deep Learning 4–20 L (CS/IT-Sem-5) Machine Learning Techniques 4–21 L (CS/IT-Sem-5)
2. Here the task of machine is to group unsorted information according to 3. A self-Organizing Map (SOM) or Self-Organizing Feature Map (SOFM)
similarities, patterns and differences without any prior training of data. is a type of Artificial Neural Network (ANN) that is trained using
3. Unlike supervised learning, no teacher is provided that means no training unsupervised learning to produce a low-dimensional (typically two-
will be given to the machine. dimensional), discretized representation of the input space of the training
samples, called a map, and is therefore a method to do dimensionality
4. Therefore machine is restricted to find the hidden structure in unlabeled reduction.
data by our-self.
4. Self-organizing maps differ from other artificial neural networks as
Que 4.19. Classify unsupervised learning into two categories of they apply competitive learning as opposed to error-correction learning
(such as backpropagation with gradient descent), and in the sense that
algorithm. they use a neighborhood function to preserve the topological properties
of the input space.
Answer
Classification of unsupervised learning algorithm into two categories : Que 4.22. Write the steps used in SOM algorithm.
1. Clustering : A clustering problem is where we want to discover the
inherent groupings in the data, such as grouping customers by Answer
purchasing behavior. Following are the steps used in SOM algorithm :
2. Association : An association rule learning problem is where we want 1. Each node’s weights are initialized.
to discover rules that describe large portions of our data, such as people
that buy X also tend to buy Y. 2. A vector is chosen at random from the set of training data.
3. Every node is examined to calculate which one’s weights are most like
Que 4.20. What are the applications of unsupervised learning ? the input vector. The winning node is commonly known as the Best
Matching Unit (BMU).
Answer 4. Then the neighbourhood of the BMU is calculated. The amount of
Following are the application of unsupervised learning : neighbors decreases over time.
1. Unsupervised learning automatically split the dataset into groups base 5. The winning weight is rewarded with becoming more like the sample
on their similarities. vector. The neighbours also become more like the sample vector. The
closer a node is to the BMU, the more its weights get altered and the
2. Anomaly detection can discover unusual data points in our dataset. It is
farther away the neighbor is from the BMU, the less it learns.
useful for finding fraudulent transactions.
6. Repeat step 2 for N iterations.
3. Association mining identifies sets of items which often occur together in
our dataset.
Que 4.23. What are the basic processes used in SOM ? Also explain
4. Latent variable models are widely used for data preprocessing. Like
reducing the number of features in a dataset or decomposing the dataset stages of SOM algorithm.
into multiple components.
Answer
Que 4.21. What is Self-Organizing Map (SOM) ? Basics processes used in SOM algorithm are :
1. Initialization : All the connection weights are initialized with small
Answer random values.
1. Self-Organizing Map (SOM) provides a data visualization technique which 2. Competition : For each input pattern, the neurons compute their
helps to understand high dimensional data by reducing the dimensions respective values of a discriminant function which provides the basis for
of data to a map. competition. The particular neuron with the smallest value of the
2. SOM also represents clustering concept by grouping similar data together. discriminant function is declared the winner.
Artificial Neural Network & Deep Learning 4–22 L (CS/IT-Sem-5) Machine Learning Techniques 4–23 L (CS/IT-Sem-5)
3. Cooperation : The winning neuron determines the spatial location of 2. Deep learning is used where the data is complex and has large datasets.
a topological neighbourhood of excited neurons, thereby providing the 3. Facebook uses deep learning to analyze text in online conversations.
basis for cooperation among neighbouring neurons. Google and Microsoft all use deep learning for image search and machine
4. Adaptation : The excited neurons decrease their individual values of translation.
the discriminant function in relation to the input pattern through suitable 4. All modern smart phones have deep learning systems running on them.
adjustment of the associated connection weights, such that the response For example, deep learning is the standard technology for speech
of the winning neuron to the subsequent application of a similar input recognition, and also for face detection on digital cameras.
pattern is enhanced.
5. In the healthcare sector, deep learning is used to process medical images
Stages of SOM algorithm are : (X-rays, CT, and MRI scans) and diagnose health conditions.
1. Initialization : Choose random values for the initial weight vectors wj. 6. Deep learning is also at the core of self-driving cars, where it is used for
2. Sampling : Draw a sample training input vector x from the input space. localization and mapping, motion planning and steering, and environment
perception, as well as tracking driver state.
3. Matching : Find the winning neuron I(x) that has weight vector closest
D
(x
Que 4.25. Describe different architecture of deep learning.
to the input vector, i.e., the minimum value of dj(x) = i w ji )2 .
i 1
Answer Answer
Advantages of deep learning : 1. Convolutional networks also known as Convolutional Neural Networks
1. Best in-class performance on problems. (CNNs) are a specialized kind of neural network for processing data
2. Reduces need for feature engineering. that has a known, grid-like topology.
3. Eliminates unnecessary costs. 2. Convolutional neural network indicates that the network employs a
mathematical operation called convolution.
4. Identifies defects easily that are difficult to detect.
3. Convolution is a specialized kind of linear operation.
Disadvantages of deep learning :
4. Convolutional networks are simply neural networks that use convolution
1. Large amount of data required. in place of general matrix multiplication in at least one of their layers.
2. Computationally expensive to train. 5. CNNs, (ConvNets), are quite similar to regular neural networks.
3. No strong theoretical foundation. 6. They are still made up of neurons with weights that can be learned from
Limitations of deep learning : data. Each neuron receives some inputs and performs a dot product.
1. Learning through observations only. 7. They still have a loss function on the last fully connected layer.
2. The issue of biases. 8. They can still use a non-linearity function a regular neural network
receives input data as a single vector and passes through a series of
Que 4.27. What are the various applications of deep learning ? hidden layers.
Answer
Following are the application of deep learning :
1. Automatic text generation : Corpus of text is learned and from this
model new text is generated, word-by-word or character-by-character.
Then this model is capable of learning how to spell, punctuate, form
sentences, or it may even capture the style.
2. Healthcare : Helps in diagnosing various diseases and treating it. output layer
3. Automatic machine translation : Certain words, sentences or
phrases in one language is transformed into another language (Deep input layer
Learning is achieving top results in the areas of text, images). hidden layer 1 hidden layer 2
4. Image recognition : Recognizes and identifies peoples and objects in
images as well as to understand content and context. This area is already Fig. 4.28.1. A regular three-layer neural network.
being used in Gaming, Retail, Tourism, etc.
5. Predicting earthquakes : Teaches a computer to perform viscoelastic 9. Every hidden layer consists of neurons, wherein every neuron is fully
computations which are used in predicting earthquakes. connected to all the other neurons in the previous layer.
10. Within a single layer, each neuron is completely independent and they
Que 4.28. Define convolutional networks. do not share any connections.
11. The fully connected layer, (the output layer), contains class scores in the
case of an image classification problem. There are three main layers in
a simple ConvNet.
Artificial Neural Network & Deep Learning 4–26 L (CS/IT-Sem-5) Machine Learning Techniques 4–27 L (CS/IT-Sem-5)
c. Non-linearity
Que 4.29. Write short note on convolutional layer.
d. Pooling layer
2. The addition of a pooling layer after the convolutional layer is a common
Answer
pattern used for ordering layers within a convolutional neural network
1. Convolutional layers are the major building blocks used in convolutional that may be repeated one or more times in a given model.
neural networks. 3. The pooling layer operates upon each feature map separately to create
2. A convolution is the simple application of a filter to an input that results a new set of the same number of pooled feature maps.
in an activation. Fully connected layer :
3. Repeated application of the same filter to an input results in a map of 1. Fully connected layers are an essential component of Convolutional
activations called a feature map, indicating the locations and strength of Neural Networks (CNNs), which have been proven very successful in
a detected feature in an input, such as an image. recognizing and classifying images for computer vision.
4. The innovation of convolutional neural networks is the ability to 2. The CNN process begins with convolution and pooling, breaking down
automatically learn a large number of filters in parallel specific to a the image into features, and analyzing them independently.
training dataset under the constraints of a specific predictive modeling
problem, such as image classification. 3. The result of this process feeds into a fully connected neural network
structure that drives the final classification decision.
5. The result is highly specific features that can be detected anywhere on
input images.
PART-5
Que 4.30. Describe briefly activation function, pooling and fully
Concept of Convolution (1D and 2D) Layers, Training of Network,
connected layer. Case Study of CNN for eg on Diabetic Retinopathy, Building a
Smart Speaker, Self Deriving Car etc.
Answer
Activation function :
1. An activation function is a function that is added into an artificial neural Questions-Answers
network in order to help the network learn complex patterns in the
data. Long Answer Type and Medium Answer Type Questions
2. When comparing with a neuron-based model that is in our brains, the
activation function is at the end deciding what is to be fired to the next
neuron.
Que 4.31. Explain 1D and 2D convolutional neural network.
3. That is exactly what an activation function does in an ANN as well.
4. It takes in the output signal from the previous cell and converts it into Answer
some form that can be taken as input to the next cell.
1D convolutional neural network :
Pooling layer :
1. Convolutional Neural Network (CNN) models were developed for image
1. A pooling layer is a new layer added after the convolutional layer. classification, in which the model accepts a two-dimensional input
Specifically, after a non-linearity (for example ReLU) has been applied representing an image's pixels and color channels, in a process called
to the feature maps output by a convolutional layer, for example, the feature learning.
layers in a model may look as follows :
a. Input image
b. Convolutional layer
Artificial Neural Network & Deep Learning 4–28 L (CS/IT-Sem-5) Machine Learning Techniques 4–29 L (CS/IT-Sem-5)
2. This same process can be applied to one-dimensional sequences of data. d. The other type of training is called unsupervised training. In
3. The model extracts features from sequences data and maps the internal unsupervised training, the network is provided with inputs but not
features of the sequence. with desired outputs.
4. A 1D CNN is very effective for deriving features from a fixed-length e. The system itself must then decide what features it will use to
segment of the overall dataset, where it is not so important where the group the input data. This is often referred to as self-organization
feature is located in the segment. or adaption.
5. 1D Convolutional Neural Networks work well for :
Que 4.33. Describe diabetic retinopathy on the basis of deep
a. Analysis of a time series of sensor data.
learning.
b. Analysis of signal data over a fixed-length period, for example, an
audio recording. Answer
c. Natural Language Processing (NLP), although Recurrent Neural
Networks which leverage Long Short Term Memory (LSTM) cells 1. Diabetic Retinopathy (DR) is one of the major causes of blindness in the
are more promising than CNN as they take into account the western world. Increasing life expectancy, indulgent lifestyles and other
proximity of words to create trainable patterns. contributing factors mean the number of people with diabetes is projected
to continue rising.
2D convolutional neural network :
2. Regular screening of diabetic patients for DR has been shown to be a
1. In a 2D convolutional network, each pixel within the image is represented cost-effective and important aspect of their care.
by its x and y position as well as the depth, representing image channels
(red, green, and blue). 3. The accuracy and timing of this care is of significant importance to both
the cost and effectiveness of treatment.
2. It moves over the images both horizontally and vertically.
4. If detected early enough, effective treatment of DR is available; making
Que 4.32. this a vital process.
How we trained a network ? Explain.
5. Classification of DR involves the weighting of numerous features and
the location of such features. This is highly time consuming for clinicians.
Answer
6. Computers are able to obtain much quicker classifications once trained,
1. Once a network has been structured for a particular application, that giving the ability to aid clinicians in real-time classification.
network is ready to be trained. 7. The efficacy of automated grading for DR has been an active area of
2. To start this process the initial weights are chosen randomly. Then, the research in computer imaging with encouraging conclusions.
training, or learning begins.
8. Significant work has been done on detecting the features of DR using
3. There are two approaches to training : automated methods such a support vector machines and k-NN classifiers.
a. In supervised training, both the inputs and the outputs are provided. 9. The majority of these classification techniques arc on two class
The network then processes the inputs and compares its resulting classification for DR or no DR.
outputs against the desired outputs.
b. Errors are then propagated back through the system, causing the Que 4.34. Using artificial neural network how we recognize
system to adjust the weights which control the network. This speaker.
process occurs over and over as the weights are continually
tweaked. Answer
c. The set of data which enables the training is called the “training
set.” During the training of a network the same set of data is 1. With the technology advancements in smart home sector, voice control
processed many times as the connection weights are ever refined. and automation are key components that can make a real difference in
people’s lives.
Artificial Neural Network & Deep Learning 4–30 L (CS/IT-Sem-5) Machine Learning Techniques 4–31 L (CS/IT-Sem-5)
2. The voice recognition technology market continues to involve rapidly as 6. Followed by the Tesla models series, its “auto-pilot” technology has made
almost all smart home devices are providing speaker recognition major breakthroughs in recent years.
capability today. 7. Although the Tesla's autopilot technology is only regarded as Level 2
3. However, most of them provide cloud-based solutions or use very deep stage by the National Highway Traffic Safety Administration (NHTSA),
Neural Networks for speaker recognition task, which are not suitable Tesla shows us that the car has basically realized automatic driving
models to run on smart home devices. under certain conditions.
4. Here, we compare relatively small Convolutional Neural Networks
(CNN) and evaluate effectiveness of speaker recognition using these
models on edge devices. In addition, we also apply transfer learning
technique to deal with a problem of limited training data.
5. By developing solution suitable for running inference locally on edge
devices, we eliminate the well-known cloud computing issues, such as
data privacy and network latency, etc.
6. The preliminary results proved that the chosen model adapts the benefit
of computer vision task by using CNN and spectrograms to perform
speaker classification with precision and recall ~ 84 % in time less than
60 ms on mobile device with Atom Cherry Trail processor.
Answer
1. The rapid development of the Internet economy and Artificial Intelligence
(AI) has promoted the progress of self-driving cars.
2. The market demand and economic value of self-driving cars are
increasingly prominent. At present, more and more enterprises and
scientific research institutions have invested in this field. Google, Tesla,
Apple, Nissan, Audi, General Motors, BMW, Ford, Honda, Toyota,
Mercedes, and Volkswagen have participated in the research and
development of self-driving cars.
3. Google is an Internet company, which is one of the leaders in self-
driving cars, based on its solid foundation in artificial intelligence.
4. In June 2015, two Google self-driving cars were tested on the road. So
far, Google vehicles have accumulated more than 3.2 million km of
tests, becoming the closest to the actual use.
5. Another company that has made great progress in the field of self-
driving cars is Tesla. Tesla was the first company to devote self-driving
technology to production.
Machine Learning Techniques 5–1 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–2 L (CS/IT-Sem-5)
PART-1
5 Reinforcement
Learning and
Genetic Algorithm
Introduction to Reinforcement Learning.
Questions-Answers
CONTENTS Answer
1. Reinforcement learning is the study of how animals and artificial systems
can learn to optimize their behaviour in the face of rewards and
Part-1 : Introduction to ............................................ 5–2L to 5–6L
punishments.
Reinforcement Learning
2. Reinforcement learning algorithms related to methods of dynamic
Part-2 : Learning Task, Example ........................... 5–6L to 5–9L programming which is a general approach to optimal control.
of Reinforcement
Learning in Practice 3. Reinforcement learning phenomena have been observed in psychological
studies of animal behaviour, and in neurobiological investigations of
Part-3 : Learning Models for ................................. 5–9L to 5–13L neuromodulation and addiction.
Reinforcement (Markov Decision
Process, Q Learning, Q Learning 4. The task of reinforcement learning is to use observed rewards to learn
Function, Q Learning Algorithm), an optimal policy for the environment. An optimal policy is a policy that
Application of Reinforcement maximizes the expected total reward.
Learning
5. Without some feedback about what is good and what is bad, the agent
will have no grounds for deciding which move to make.
Part-4 : Introduction to Deep .............................. 5–13L to 5–15L
Q Learning 6. The agents needs to know that something good has happened when it
wins and that something bad has happened when it loses.
Part-5 : Genetic Algorithm, ................................. 5–15L to 5–30L
Introduction, Components, 7. This kind of feedback is called a reward or reinforcement.
GA Cycle of Reproduction,
Crossover, Mutation, 8. Reinforcement learning is valuable in the field of robotics, where the
Genetic Programming, tasks to be performed are frequently complex enough to defy encoding
Models of Evolution and as programs and no training data is available.
Learning, Application.
9. In many complex domains, reinforcement learning is the only feasible
way to train a program to perform at high levels.
Machine Learning Techniques 5–3 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–4 L (CS/IT-Sem-5)
Que 5.3. What is reinforcement learning ? Explain passive 9. Each state percept is subscripted with the reward received. The object is
reinforcement learning and active reinforcement learning. to use the information about rewards to learn the expected utility U(s)
associated with each non-terminal state s.
Answer 10. The utility is defined to be the expected sum of (discounted) rewards
Reinforcement learning : Refer Q. 5.1, Page 5–2L, Unit-5. obtained if policy is followed :
Machine Learning Techniques 5–5 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–6 L (CS/IT-Sem-5)
Que 5.7. Explain different machine learning task. Que 5.8. Explain reinforcement learning with the help of an
example.
Answer
Following are most common machine learning tasks : Answer
1. Data preprocessing : Before starting training the models, it is 1. Reinforcement learning (RL) is learning concerned with how software
important to prepare data appropriately. As part of data preprocessing agents ought to take actions in an environment in order to maximize
following is done : the notion of cumulative reward.
a. Data cleaning 2. The software agent is not told which actions to take, but instead must
discover which actions yield the most reward by trying them.
b. Handling missing data
For example,
2. Exploratory data analysis : Once data is preprocessed, the next step
is to perform exploratory data analysis to understand data distribution Consider the scenario of teaching new tricks to a cat :
and relationship between / within the data. 1. As cat does not understand English or any other human language, we
cannot tell her directly what to do. Instead, we follow a different strategy.
Machine Learning Techniques 5–9 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–10 L (CS/IT-Sem-5)
2. We emulate a situation, and the cat tries to respond in many different iii. State (s) : State refers to the current situation returned by the
ways. If the cat's response is the desired way, we will give her fish. environment.
3. Now whenever the cat is exposed to the same situation, the cat executes iv. Policy () : It is a strategy which applies by the agent to decide the next
a similar action even more enthusiastically in expectation of getting action based on the current state.
more reward (food). v. Value (V) : It is expected long-term return with discount, as compared
4. That’s like learning that cat gets from "what to do" from positive to the short-term reward.
experiences. vi. Value Function : It specifies the value of a state that is the total
5. At the same time, the cat also learns what not do when faced with amount of reward. It is an agent which should be expected beginning
negative experiences. from that state.
Working of reinforcement learning : vii. Model of the environment : This mimics the behavior of the
1. In this case, the cat is an agent that is exposed to the environment (In environment. It helps you to make inferences to be made and also
this case, it is your house). An example of a state could be our cat sitting, determine how the environment will behave.
and we use a specific word in for cat to walk. viii. Model based methods : It is a method for solving reinforcement
2. Our agent reacts by performing an action transition from one “state” to learning problems which use model-based methods.
another “state.” ix. Q value or action value (Q) : Q value is quite similar to value. The
3. For example, the cat goes from sitting to walking. only difference between the two is that it takes an additional parameter
as a current action.
4. The reaction of an agent is an action, and the policy is a method of
selecting an action given a state in expectation of better outcomes. Que 5.10. Explain approaches used to implement reinforcement
5. After the transition, they may get a reward or penalty in return.
learning algorithm.
Answer
PART-3
There are three approaches used implement a reinforcement learning algorithm :
Learning Models for Reinforcement (Markov Decision Process, Q
1. Value-Based :
Learning, Q Learning Function, Q Learning Algorithm), Application
of Reinforcement Learning. a. In a value-based reinforcement learning method, we should try to
maximize a value function V(s). In this method, the agent is expecting a
long-term return of the current states under policy .
Questions-Answers 2. Policy-based :
a. In a policy-based RL method, we try to come up with such a policy that
Long Answer Type and Medium Answer Type Questions the action performed in every state helps you to gain maximum reward
in the future.
b. Two types of policy-based methods are :
Que 5.9. Describe important term used in reinforcement learning
i. Deterministic : For any state, the same action is produced by the
method. policy .
Answer ii. Stochastic : Every action has a certain probability, which is
determined by the following equation stochastic policy :
Following are the terms used in reinforcement learning :
n(a/s) = P/A = a/S = S
Agent : It is an assumed entity which performs actions in an environment to 3. Model-Based :
gain some reward. a. In this Reinforcement Learning method, we need to create a virtual
i. Environment (e) : A scenario that an agent has to face. model for each environment.
ii. Reward (R) : An immediate return given to an agent when he or she b. The agent learns to perform in that specific environment.
performs specific action or task.
Machine Learning Techniques 5–11 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–12 L (CS/IT-Sem-5)
Q Tables
Correct path, Wrong path and End. First, lets initialize the value at 0. Q learning
Step 2 : Choose an action. Q-value action 1
Step 3 : Perform an action : The combination of steps 2 and 3 is performed
for an undefined amount of time. These steps run until the time training is State Q-value action 2
stopped, or when the training loop stopped as defined in the code.
a. First, an action (a) in the state (s) is chosen based on the Q-table. Note Q-value action N
that, when the episode initially starts, every Q-value should be 0.
Deep Q learning
b. Then, update the Q-values for being at the start and moving right using Fig. 5.16.1.
the Bellman equation.
4. On a higher level, Deep Q learning works as such :
Step 4 : Measure reward : Now we have taken an action and observed an
outcome and reward. i. Gather and store samples in a replay buffer with current policy.
Step 5 : Evaluate : We need to update the function Q(s, a) ii. Random sample batches of experiences from the replay buffer.
This process is repeated again and again until the learning is stopped. In this iii. Use the sampled experiences to update the Q network.
way the Q-table is been updated and the value function Q is maximized. Here iv. Repeat 1-3.
the Q returns the expected future reward of that action at that state.
Que 5.17. What are the steps involved in deep Q-learning network ?
PART-4 Answer
Introduction to Deep Q Learning. Steps involved in reinforcement learning using deep Q-learning networks :
1. All the past experience is stored by the user in memory.
2. The next action is determined by the maximum output of the Q-network.
Questions-Answers
3. The loss function here is mean squared error of the predicted Q-value
Long Answer Type and Medium Answer Type Questions and the target Q-value – Q*. This is basically a regression problem.
4. However, we do not know the target or actual value here as we are
dealing with a reinforcement learning problem. Going back to the
Que 5.16. Describe deep Q-learning. Q-value update equation derived from the Bellman equation, we have :
Q(St, At) Q(St, At) + [ Rt 1 max Q( St 1 , a) Q( St , At )]
Answer a
1. In deep Q-learning, we use a neural network to approximate the Q- Que 5.18. Write pseudocode for deep Q-learning.
value function.
2. The state is given as the input and the Q-value of all possible actions is Answer
generated as the output. Start with Q0(s, a) for all s, a.
3. The comparison between Q-learning and deep Q-learning is illustrated Get initial state s
below :
For k = 1, 2, … till convergence
Sample action a, get next state s
Machine Learning Techniques 5–15 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–16 L (CS/IT-Sem-5)
Answer
PART-5 Procedure of Genetic algorithm :
Genetic Algorithm, Introduction, Components, GA Cycle of 1. Generate a set of individuals as the initial population.
Reproduction, Crossover, Mutation, Genetic Programming, 2. Use genetic operators such as selection or cross over.
Models of Evolution and Learning, Application.
3. Apply mutation or digital reverse if necessary.
4. Evaluate the fitness function of the new population.
5. Use the fitness function for determining the best individuals and replace
Questions-Answers
predefined members from the original population.
Long Answer Type and Medium Answer Type Questions 6. Iterate steps 2–5 and terminate when some predefined population
threshold is met.
Advantages of genetic algorithm :
Que 5.19. Write short note on Genetic algorithm. 1. Genetic algorithms can be executed in parallel. Hence, genetic algorithms
are faster.
Answer 2. It is useful for solving optimization problems.
1. Genetic algorithms are computerized search and optimization algorithm Disadvantages of Genetic algorithm :
based on mechanics of natural genetics and natural selection.
1. Identification of the fitness function is difficult as it depends on the
2. These algorithms mimic the principle of natural genetics and natural problem.
selection to construct search and optimization procedure.
2. The selection of suitable genetic operators is difficult.
3. Genetic algorithms convert the design space into genetic space. Design
space is a set of feasible solutions. Que 5.21. Explain different phases of genetic algorithm.
4. Genetic algorithms work with a coding of variables.
Answer
5. The advantage of working with a coding of variables space is that coding
discretizes the search space even though the function may be continuous. Different phases of genetic algorithm are :
6. Search space is the space for all possible feasible solutions of particular 1. Initial population :
problem. a. The process begins with a set of individuals which is called a
7. Following are the benefits of Genetic algorithm : population.
a. They are robust. b. Each individual is a solution to the problem we want to solve.
b. They provide optimization over large space state. c. An individual is characterized by a set of parameters (variables)
known as genes.
c. They do not break on slight change in input or presence of noise.
d. Genes are joined into a string to form a chromosome (solution).
8. Following are the application of Genetic algorithm :
e. In a genetic algorithm, the set of genes of an individual is represented
a. Recurrent neural network
using a string.
Machine Learning Techniques 5–17 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–18 L (CS/IT-Sem-5)
A2 1 1 1 1 1 1
A2 1 1 1 1 1 1 Chromosome
e. The new offspring are added to the population.
A3 A5 1 1 1 0 0 0
1 0 1 0 1 1
A4 A6 0 0 0 1 1 1
1 1 0 1 1 0 Population
5. Mutation :
2. FA (Factor Analysis) fitness function : a. When new offspring formed, some of their genes can be subjected
to a mutation with a low random probability.
a. The fitness function determines how fit an individual is (the ability
of all individual to compete with other individual). b. This implies that some of the bits in the bit string can be flipped.
b. It gives a fitness score to each individual. Before mutation
c. The probability that an individual will be selected for reproduction A5 1 1 1 0 0 0
is based on its fitness score.
After mutation
3. Selection :
A5 1 1 0 1 1 0
a. The idea of selection phase is to select the fittest individuals and let
them pass their genes to the next generation. c. Mutation occurs to maintain diversity within the population and
b. Two pairs of individuals (parents) are selected based on their fitness prevent premature convergence.
scores. 6. Termination :
c. Individuals with high fitness have more chance to be selected for a. The algorithm terminates if the population has converged (does
reproduction. not produce offspring which are significantly different from the
4. Crossover : previous generation).
b. Then it is said that the genetic algorithm has provided a set of
a. Crossover is the most significant phase in a genetic algorithm.
solutions to our problem.
b. For each pair of parents to be mated, a crossover point is chosen at
random from within the genes. Que 5.22. Draw a flowchart of GA and explain the working
c. For example, consider the crossover point to be 3 as shown : principle.
A1 0 0 0 0 0 0 Answer
Genetic algorithm : Refer Q. 1.24, Page 1–23L, Unit-1.
Working principle :
A2 1 1 1 1 1 1
1. To illustrate the working principle of GA, we consider unconstrained
optimization problem.
Crossover point 2. Let us consider the following maximization problem :
d. Offspring are created by exchanging the genes of parents among maximize f(X)
themselves until the crossover point is reached.
X i( L ) X i X i(U ) for i = 1, 2 ... N,
Machine Learning Techniques 5–19 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–20 L (CS/IT-Sem-5)
3. If we want to minimize f(X), for f(X) > 0, then we can write the objective 2. Definition of representation for the problem.
function as : 3. Premature convergence occurs.
1 4. The problem of choosing the various parameters like the size of the
maximize
1 f ( X) population, mutation rate, crossover rate, the selection method and its
strength.
4. If f(X) < 0 instead of minimizing f(X), maximize {–f(X)}. Hence, both
maximization and minimization problems can be handled by GA. 5. Cannot use gradients.
6. Cannot easily incorporate problem specific information.
Que 5.23. Write short notes on procedures of GA.
7. Not good at identifying local optima.
Answer 8. No effective terminator.
1. Start : Generate random population of n chromosomes. 9. Not effective for smooth unimodal functions.
2. Fitness : Evaluate the fitness f(x) of each chromosome x in the 10. Needs to be coupled with a local search technique.
population.
Que 5.25. Write short notes of genetic representations.
3. New population : Create a new population by repeating following
steps until the new population is complete.
Answer
a. Selection : Select two parent chromosomes from a population
according to their fitness. 1. Genetic representation is a way of representing solutions/individuals in
evolutionary computation methods.
b. Crossover : With a crossover probability crossover the parents
to form new offspring (children). If no crossover was performed, 2. Genetic representation can encode appearance, behavior, physical
offspring is the exact copy of parents. qualities of individuals.
c. Mutation : With a mutation probability mutate new offspring at 3. All the individuals of a population are represented by using binary
each locus (position in chromosome). encoding, permutational encoding, encoding by tree.
d. Accepting : Place new offspring in the new population. 4. Genetic algorithms use linear binary representations. The most standard
method of representation is an array of bits.
4. Replace : Use new generated population for a further run of the
algorithm. 5. These genetic representations are convenient because parts of individual
are easily aligned due to their fixed size which makes simple crossover
5. Test : If the end condition is satisfied, stop, and return the best solution operation.
in current population.
6. Go to step 2 Que 5.26. Give the detail of genetic representation (Encoding).
Que 5.24. What are the benefits of using GA ? What are its OR
Explain different types of encoding in genetic algorithm.
limitations ?
Answer
Answer
Genetic representations :
Benefits of using GA :
1. Encoding :
1. It is easy to understand.
a. Encoding is a process of representing individual genes.
2. It is modular and separate from application.
b. The process can be performed using bits, numbers, trees, arrays,
3. It supports multi-objective optimization. lists or any other objects.
4. It is good for noisy environment. c. The encoding depends mainly on solving the problem.
Limitations of genetic algorithm are : 2. Binary encoding :
1. The problem of identifying fitness function. a. Binary encoding is the most commonly used method of genetic
representation because GA uses this type of encoding.
Machine Learning Techniques 5–21 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–22 L (CS/IT-Sem-5)
b. In permutation encoding, every chromosome is a string of numbers, i. Roulette-wheel selection is the proportionate reproductive method
which represents number in a sequence. where a string is selected from the mating pool with a probability
proportional to the fitness.
Chromosome A 1 5 3 2 6 4 7 9 8 ii. Thus, ith string in the population is selected with a probability
Chromosome B 8 5 6 7 2 3 1 4 9 proportional to Fi where Fi is the fitness value for that string.
iii. Since the population size is usually kept fixed in Genetic Algorithm,
5. Value encoding :
the sum of the probabilities of each string being selected for the
a. Direct value encoding can be used in problems, where some mating pool must be one.
complicated values, such as real numbers, are used.
iv. The probability of the ith selected string is
b. In value encoding, every chromosome is a string of some values.
Fi
c. Values can be anything connected to problem, real numbers or pi = n
chars to some complicated objects. F j
j 1
Chromosome A 1.2324 5.3243 0.4556 2.3293 2.4545 where ‘n’ is the population size.
Chromosome B ABDJEIFJDHDIERJFDLDFLFEGT v. The average fitness is
n
5. Image processing : GAs are used for various digital image processing
Que 5.31. What is the main function of crossover operation in (DIP) tasks as well like dense pixel matching.
genetic algorithm ? 6. Machine learning : Genetics based machine learning (GBML) is a
nice area in machine learning.
Answer
7. Robot trajectory generation : GAs have been used to plan the path
1. Crossover is the basic operator of genetic algorithm. Performance of which a robot arm takes by moving from one point to another.
genetic algorithm depends on crossover operator.
2. Type of crossover operator used for a problem depends on the type of Que 5.33. Explain optimization of travelling salesman problem
encoding used. using genetic algorithm and give a suitable example too.
3. The basic principle of crossover process is to exchange genetic material
of two parents beyond the crossover points. Answer
Function of crossover operation/operator in genetic algorithm : 1. The TSP consist a number of cities, where each pair of cities has a
corresponding distance.
1. The main function of crossover operator is to introduce diversity in the
population.
Start
2. Specific crossover made for a specific problem can improve performance
of the genetic algorithm.
Set GA parameters
3. Crossover combines parental solutions to form offspring with a hope
to produce better solutions.
Generate initial random
4. Crossover operators are critical in ensuring good mixing of building
population
blocks.
5. Crossover is used to maintain balance between exploitation and
Evaluate fitness of each
exploration. The exploitation and exploration techniques are
chromosome in the
responsible for the performance of genetic algorithms. Exploitation population
means to use the already existing information to find out the better
solution and exploration is to investigate new and unknown solution
in exploration space.
Yes Are optimization
Que 5.32. Discuss the different applications of genetic algorithms. termination New population
criteria met ?
Answer Best
chromosome No
Application of GA :
1. Optimization : Genetic Algorithms are most commonly used in Parents selection for next
optimization problems wherein we have to maximize or minimize a generation
End
given objective function value under a given set of constraints.
2. Economics : GAs are also used to characterize various economic models Crossover of
like the cobweb model, game theory equilibrium resolution, asset pricing, parents chromosome
etc.
3. Neural networks : GAs are also used to train neural networks, Mutation of
particularly recurrent neural networks. chromosome
4. Parallelization : GAs also have very good parallel capabilities, and
prove to be very effective means in solving certain problems, and also Fig. 5.33.1. Genetic algorithm procedure for TSP.
provide a good area for research.
Machine Learning Techniques 5–29 L (CS/IT-Sem-5) Reinforcement Learning & Genetic Algorithm 5–30 L (CS/IT-Sem-5)
2. The aim is to visit all the cities such that the total distance travelled will so far found, and the average fitness comes very close to the fitness of
be minimized. the best individuals.
3. A solution, and therefore a chromosome which represents that solution 5. The convergence criteria can be explained from schema point of view.
to the TSP, can be given as an order, that is, a path, of the cities. 6. A schema is a similarity template describing a subset of strings with
4. The procedure for solving TSP can be viewed as a process flow given in similarities at certain positions. A schema represents a subset of all
Fig. 5.33.1. possible strings that have the same bits at certain string positions.
5. The GA process starts by supplying important information such as location 7. Since schema represents a robust of strings, we can associate a fitness
of the city, maximum number of generations, population size, probability value with a schema, i.e., the average fitness of the schema.
of crossover and probability of mutation. 8. One can visualize GA’s search for the optimal strings as a simultaneous
6. An initial random population of chromosomes is generated and the competition among schema increases the number of their instances in
fitness of each chromosome is evaluated. the population.
7. The population is then transformed into a new population (the next
generation) using three genetic operators : selection, crossover and
mutation.
8. The selection operator is used to choose two parents from the current
generation in order to create a new child by crossover and/or mutation.
9. The new generation contains a higher proportion of the characteristics
possessed by the good members of the previous generation and in this
way good characteristics are spread over the population and mixed with
other good characteristics.
10. After each generation, a new set of chromosomes where the size is
equal to the initial population size is evolved.
11. This transformation process from one generation to the next continues
until the population converges to the optimal solution, which usually
occurs when a certain percentage of the population (for example 90 %)
has the same optimal chromosome in which the best individual is taken
as the optimal solution.
Answer
1. A genetic algorithm is usually said to converge when there is no significant
improvement in the values of fitness of the population from one
generation to the next.
2. One criterion for convergence may be such that when a fixed percentage
of columns and rows in population matrix becomes the same, it can be
assumed that convergence is attained. The fixed percentage may be
80% or 85%.
3. In genetic algorithms as we proceed with more generations, there may
not be much improvement in the population fitness and the best
individual may not change for subsequent populations.
4. As the generation progresses, the population gets filled with more fit
individuals with only slight deviation from the fitness of best individuals
Machine Learning Techniques SQ–1 L (CS/IT-Sem-5) 2 Marks Questions SQ–2 L (CS/IT-Sem-5)
1
Ans. Role of machine learning in human life :
1. Learning
2. Reasoning
3. Problem solving
Introduction 4. Language understanding
(2 Marks Questions) 1.7. What are the components of machine learning system ?
Ans. Components of machine learning system are :
1. Sensing
2. Segmentation
3. Feature extraction
4. Classification
1.1. Define machine learning. 5. Post processing
Ans. Machine learning is an application of artificial intelligence that
provides systems the ability to automatically learn and improve 1.8. What are the classes of problem in machine learning ?
from experience without being explicitly programmed. Ans. Classes of problem in machine learning are :
1. Classification
1.2. What are the different types of machine learning 2. Regression
algorithm ? 3. Clustering
Ans. Different types of machine learning algorithm are : 4. Rule extraction
1. Supervised machine learning algorithm
2. Unsupervised machine learning algorithm 1.9. What are the issues related with machine learning ?
3. Semi-supervised machine learning algorithm Ans. Issues related with machine learning are :
4. Reinforcement machine learning algorithm 1. Data quality
2. Transparency
1.3. What are the applications of machine learning ? 3. Traceability
Ans. Applications of machine learning are : 4. Reproduction of results
1. Image recognition
2. Speech recognition 1.10. Define supervised learning.
3. Medical diagnosis Ans. Supervised learning is also known as associative learning, in which
4. Statistical arbitrage the network is trained by providing it with input and matching
5. Learning association output patterns.
1.4. What are the advantages of machine learning ? 1.11. Define unsupervised learning ?
Ans. Advantages of machine learning : Ans. Unsupervised learning is also known as self-organization, in which
1. Easily identifies trends and patterns. an output unit is trained to respond to clusters of pattern within the
2. No human intervention is needed. input.
3. Continuous improvement.
4. Handling multi-dimensional and multi-variety data. 1.12. Define well defined learning problem.
Ans. A computer program is said to learn from experience E with respect
1.5. What are the disadvantages of machine learning ? to some class of tasks T and performance measure P, if its
Ans. Disadvantages of machine learning : performance at tasks in T, as measured by P, improves with
1. Data acquisition experience E.
2. Time and resources
3. Interpretation of results
4. High error-susceptibility
Machine Learning Techniques SQ–3 L (CS/IT-Sem-5) 2 Marks Questions SQ–4 L (CS/IT-Sem-5)
1.13. What are the features of learning problems ? Ans. Issues related with decision tree are :
Ans. Features of learning problems are : 1. Missing data
1. The class of tasks (T). 2. Multi-valued attribute
2. The measure of performance to be improved (P). 3. Continuous and integer valued input attributes
3. The source of experience (E). 4. Continuous-valued output attributes
1.14. Define decision tree learning. 1.20. What are the attribute selection measures used in decision
tree ?
Ans. Decision tree learning is the predictive modeling approaches used
in statistics, data mining and machine learning. It uses a decision Ans. Attribute selection measures used in decision tree are :
tree to go from observations about an item to conclusions about the 1. Entropy
item’s target values.
2. Information gain
3. Gain ratio
1.15. What is decision tree ?
Ans. A decision tree is a decision support tool that uses a tree-like model
of decisions and their possible consequences, including chance event
outcomes, resource costs and utility.
1.19. What are the issues related with the decision tree ?
Machine Learning Techniques SQ–5 L (CS/IT-Sem-5) 2 Marks Questions SQ–6 L (CS/IT-Sem-5)
2
Ans. Bayesian belief networks specify joint conditional probability
distributions. They are also known as Belief Networks, Bayesian
Networks, or Probabilistic Networks.
2.1. Define the term regression. 2.8. What are the usages of EM algorithm ?
Ans. Regression is a statistical method used in finance, investing, and Ans. Usages of EM algorithm are :
other disciplines that attempts to determine the strength and
character of the relationship between one dependent variable and 1. It can be used to fill the missing data in a sample.
a series of other variables (known as independent variables). 2. It can be used as the basis of unsupervised learning of clusters.
3. It can be used for the purpose of estimating the parameters of
2.2. What are the types of regression ? Hidden Markov Model (HMM).
4. It can be used for discovering the values of latent variables.
Ans. Following are the types of regression :
1. Linear regression
2. Logistic regression 2.9. What are the advantages of EM algorithm ?
Ans. Advantages of EM algorithm are :
2.3. Define logistic regression. 1. It is always guaranteed that likelihood will increase with each
iteration.
Ans. Logistic regression is a supervised learning classification algorithm
used to predict the probability of a target variable. The nature of 2. The E-step and M-step are easy implementation.
target or dependent variable is dichotomous, which means there 3. Solutions to the M-steps exist in the closed form.
would be only two possible classes.
2.10. What are the disadvantages of EM algorithm ?
2.4. What are the types of logistic regression ? Ans. Disadvantages of EM algorithm are :
Ans. Following are the types of logistic regression : 1. It has slow convergence.
1. Binary or Binomial logistic regression 2. It makes convergence to the local optima only.
2. Multinomial logistic regression 3. It requires both the probabilities, forward and backward (numerical
3. Ordinal logistic regression optimization requires only forward probability).
2.5. Define Bayesian decision theory. 2.11. Define support vector machine.
Ans. Bayesian decision theory is a fundamental statistical approach to Ans. A support vector machine is a supervised machine learning
the problem of pattern classification. This approach is based on algorithm that looks at data and sorts, analyzes data for
quantifying the tradeoffs between various classification decisions classification and regression analysis.
using probability and costs that accompany such decisions.
Machine Learning Techniques SQ–7 L (CS/IT-Sem-5) 2 Marks Questions SQ–8 L (CS/IT-Sem-5)
3
Ans. Types of support vector machine are :
1. Linear support vector machine
2. Non-linear support vector machine
Decision Tree Learning
2.13. What are the applications of SVM ?
Ans. Applications of SVM :
(2 Marks Questions)
1. Text and hypertext classification
2. Image classification
3. Recognizing handwritten characters
3.1. What is instance-based learning ?
4. Biological sciences, including protein classification Ans. Instance-Based Learning (IBL) is an extension of nearest
neighbour or KNN classification algorithms that do not maintain
a set of abstraction of model created from the instances.
used to generate predictions of goal feature values for subsequently 3.12. What are the advantages of instance-based learning ?
presented cases. Ans. Advantages of instance-based learning :
1. Learning is trivial
3.7. What are the dis advantages of CBL (Cas e-Bas ed 2. Works efficiently
Learning) ? 3. Noise resistant
Ans. Disadvantage of case-based learning algorithm : 4. Rich representation, arbitrary decision surfaces
1. They are computationally expensive because they save and compute 5. Easy to understand
similarities to all training cases.
2. They are intolerant of noise and irrelevant features. 3.13. What are the disadvantages of instance-based learning ?
3. They are sensitive to the choice of the algorithm’s similarity function. Ans. Disadvantages of instance-based learning :
4. There is no simple way they can process symbolic valued feature 1. Need lots of data.
values. 2. Computational cost is high.
3. Restricted to x Rn.
3.8. What are the functions of CBL ? 4. Implicit weights of attributes (need normalization).
Ans. Functions of case-based learning algorithm are : 5. Need large space for storage i.e., require large memory.
1. Pre-processor 6. Expensive application time.
2. Similarity
3. Prediction
4. Memory updating
2. Sigmoidal function
4
3. Identity function
4. Binary step function
Artificial Neural 5. Bipolar step function
Network
4.6. Give advantages of neural network.
(2 Marks Questions) Ans. Advantages of neural network :
1. A neural network can perform tasks that a linear program cannot.
2. It can be implemented in any application.
4.1. What are neurons ? 3. A neural network learns and does not need to be reprogrammed.
Ans. A neuron is a small cell that receives electro-chemical signals from
its various sources and in return responds by transmitting electrical 4.7. What are disadvantages of neural network (NN) ?
impulses to other neurons.
Ans. Disadvantages of neural network :
4.2. What is artificial neural network ? 1. The neural network needs training to operate.
Ans. Artificial neural network are computational algorithm that intended 2. It requires high processing time for large NN.
to simulate the behaviour of biological systems composed of neurons.
4.3. Give the difference between supervised and unsupervised 4.8. List the various types of soft computing techniques and
learning in artificial neural network. mention some application areas for neural network.
Ans. Ans. Types of soft computing techniques :
S. No. Supervised learning Unsupervised learning 1. Fuzzy logic control
2. Neural network
1. It uses known and labeled It uses unknown data as input.
3. Genetic algorithms
data as input.
4. Support vector machine
2. It uses offline analysis. It uses real time analysis of data.
Application areas for neural network :
3. N umber o f classe s is Number of classes is not known. 1. Speech recognition
known.
2. Character recognition
4. Accurate and reliable Moderate accurate and reliable 3. Signature verification application
results. results.
4. Human face recognition
4.4. Define activation function. 4.9. Draw a biological NN and explain the parts.
Ans. An activation function is the basic element in neural model. It is Ans.
used for limiting the amplitude of the output of a neuron. It is also
called squashing function. 1. Biological neural networks are made up of real biological neurons
that are connected in the peripheral nervous system.
2. In general a biological neural network is composed of a group of
4.5. Give types of activation function. chemically connected or functionally associated neurons.
Ans. Types of activation function :
1. Signum function
Machine Learning Techniques SQ–13 L (CS/IT-Sem-5) 2 Marks Questions SQ–14 L (CS/IT-Sem-5)
Hidden
Input
Output
Axon
Dendrites
Cell body (Soma)
Fig. 4.9.1.
5
Ans.
1. Self Organizing Map (SOM) provides a data visualization technique
which helps to understand high dimensional data by reducing the Reinforcement
dimensions of data to a map.
2. SOM also represents clustering concept by grouping similar data
Learning
together. (2 Marks Questions)
Ans. Genetic Programming (GP) is a type of Evolutionary Algorithm 5.10. What are the disadvantages of learning in evolution ?
(EA), a subset of machine learning. EAs are used to discover
Ans. Disadvantages of learning in evolution are :
solution to problems that human do not know how to solve.
1. A delay in the ability to acquire fitness.
2. Increased unreliability.
5.6. What are the advantages of genetic programming ?
Ans. Advantages of genetic programming are :
5.11. Define learnable evolution model.
1. In GP, the number of possible programs that can be constructed by
the algorithm is immense. Ans. Learnable Evolution Model (LEM) is a non-Darwinian methodology
for evolutionary computation that employs machine learning to
2. Although GP uses machine code which helps in providing result guide the generation of new individuals (candidate problem
very fast but if any of the high level language is used which needs solutions).
to be compile, and can generate errors and can make our program
slow.
5.12. What are different phases of genetic algorithm ?
3. There is a high probability that even a very small variation has a
disastrous effect on fitness of the solution generated. Ans. Different phases of genetic algorithm are :
1. Initial population
5.7. What are the disadvantages of genetic programming ? 2. FA (Factor Analysis) fitness function
Ans. Disadvantages of genetic programming are : 3. Selection
1. It does impose any fixed length of solution, so the maximum length 4. Crossover
can be extended up to hardware limits. 5. Mutation
2. In genetic programming it is not necessary for an individual to 6. Termination
have maximum knowledge of the problem and their solutions.
5.9. What are the functions of learning in evolution ? Ans. Properties of heuristic search are :
1. Admissibility condition
Ans. Function of learning in evolution :
2. Completeness condition
1. It allows individuals to adapt changes in the environment that
occur in the life span of an individual or across few generations. 3. Dominance properties
2. It allows evolution to use information extracted from the 4. Optimality property
environment thereby channeling evolutionary search.
3. It can help and guide evolution. 5.16. What are different types of reinforcement learning ?
Machine Learning Techniques SQ–19 L (CS/IT-Sem-5)