Machine Learning Complete Notes
Machine Learning Complete Notes
OBJECTIVES:
UNIT -I
The ingredients of machine learning, Tasks: the problems that can be solved with
machine learning, Models: the output of machine learning, Features, the workhorses of
machine learning. Binary classification and related tasks: Classification, Scoring and
ranking, Class probability estimation
UNIT- II
Concept learning: The hypothesis space, Paths through the hypothesis space, Beyond
conjunctive concepts
UNIT- III
Tree models: Decision trees, Ranking and probability estimation trees, Tree learning as
variance reduction.
Rule models: Learning ordered rule lists, Learning unordered rule sets, Descriptive rule
learning, First-order rule learning
UNIT –IV
Linear models: The least-squares method, The perception: a heuristic learning algorithm
for linear classifiers, Support vector machines, obtaining probabilities from linear
classifiers, Going beyond linearity with kernel methods.
Distance Based Models: Introduction, Neighbours and exemplars, Nearest Neighbours
classification, Distance Based Clustering, Hierarchical Clustering.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
UNIT- V
Probabilistic models: The normal distribution and its geometric interpretations, Probabilistic
models for categorical data, Discriminative learning by optimizing conditional like li hood
Probabilistic models with hidden variables.
Features: Kinds of feature, Feature transformations, Feature construction and selection. Model
ensembles: Bagging and random forests, Boosting
UNIT- VI
TEXT BOOKS:
1. Machine Learning: The art and science of algorithms that make sense of data, Peter Flach,
Cambridge.
2. Machine Learning, Tom M. Mitchell, MGH.
REFERENCE BOOKS:
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Machine Learning
Machine Learning is an algorithm. That has ability to learn from past experience.
Machine learning combines data with statistical tools to predict an output. This output is
then used by corporate to makes actionable insights.
Machine learning is closely related to data mining and Bayesian predictive modeling. The
machine receives data as input, use an algorithm to formulate answers.
A typical machine learning tasks are to provide a recommendation. For those who have a
Netflix account, all recommendations of movies or series are based on the user‘s
historical data.
Machine learning is also used for a variety of task like fraud detection, predictive
maintenance, portfolio optimization task and so on.
Machine learning is only one functionality and we can use different programs.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Machine Learning: Machine learning is supposed to overcome this issue. The machine learns how the input
and output data are correlated and it writes a rule. The programmers do not need to write new rules each time
there is new data. The algorithms adapt in response to new data and experiences to improve efficacy over
time.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Supervised learning
Type of machine learning in which machine are trained using well labeled training data and
machine predict the output. Labeled data means some input data is already tagged with the
correct output.
Classification
Classification is a supervised learning
Classification is a categorical variable
Help you divide your data into different classes and the algorithm which implements the
classification on a dataset is known as a classifier.
There are two types of classifications
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
1) Binary classification: if the classification problem has only two possible classes is called
binary classification(T/F,Y/N,0,1)
2) Multi class classification: if the classification program has more than two classes is
called multi class classification(Movies, Music)
Unsupervised learning is a type of algorithm that learns patterns from untagged data. It
mainly deal with the unlabelled data Unsupervised learning algorithm allows users to
perform more complex processing task compared to supervised learning.
Clustering
Clustering is a unsupervised learning. There is not any label for each instance of data.
Clustering is alternatively called as grouping Clustering is the task of grouping a set of
objects in such a way that objects in the same group are more similar to each other than to
those in other group.
Exclusive cluster
Overlap cluster
Hierarchical
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Reinforcement Learning
1. Image Recognition
Image recognition is one of the most common applications of machine learning. It is used to identify objects,
persons, places, digital images, etc. The popular use case of image recognition and face detection is, Automatic
friend tagging suggestion: Face book provides us a feature of auto friend tagging suggestion. Whenever we
upload a photo with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm. It is based on the
Facebook project named "Deep Face," which is responsible for face recognition and person identification in the
picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a
popular application of machine learning. Speech recognition is a process of converting voice instructions into
text, and it is also known as "Speech to text", or "Computer speech recognition." At present, machine learning
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
algorithms are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.
3. Traffic prediction
If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the shortest
route and predicts the traffic conditions. It predicts the traffic conditions such as whether traffic is cleared, slow-
moving, or heavily congested with the help of two ways:
Real Time location of the vehicle form Google Map app and sensors
Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information from the user and
sends back to its database to improve the performance.
4. Product recommendations
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product on
Amazon, then we started getting an advertisement for the same product while internet surfing on the same
browser and this is because of machine learning. Google understands the user interest using various machine
learning algorithms and suggests the product as per customer interest. As similar, when we use Netflix, we find
some recommendations for entertainment series, movies, etc., and this is also done with the help of machine
learning.
5. Email Spam
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always
receive an important mail in our inbox with the important symbol and spam emails in our spam box, and the
technology behind this is Machine learning. Below are some spam filters used by Gmail:
Content Filter
Header filter
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
General blacklists filter
Rules-based filters
Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes
classifier are used for email spam filtering and malware detection.
Machine learning is making our online transaction safe and secure by detecting fraud transaction. Whenever we
perform some online transaction, there may be various ways that a fraudulent transaction can take place such
as fake accounts, fake ids, and steal money in the middle of a transaction. So to detect this, Feed Forward
Neural network helps us by checking whether it is a genuine transaction or a fraud transaction. For each
genuine transaction, the output is converted into some hash values, and these values become the input for the
next round. For each genuine transaction, there is a specific pattern which gets change for the fraud transaction
hence, it detects it and makes our online transactions more secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up and
downs in shares, so for this machine learning's long short term memory neural network is used for the
prediction of stock market trends.
8. Medical Diagnosis
In medical science, machine learning is used for diseases diagnoses. With this, medical technology is growing
very fast and able to build 3D models that can predict the exact position of lesions in the brain. It helps in
finding brain tumors and other brain-related diseases easily.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
The ingredients of machine learning Tasks: The problems that can be
solved with machine learning
MANUAL DATA ENTRY: ML programs use the discovered data to improve the process as more
calculations are made. Thus machines can learn to perform time-intensive documentation and data entry
tasks.
DETECTING SPAM: Spam detection is the earliest problem solved by Machine Learning. Four
years ago, email service providers used pre-existing rule-based techniques to remove spam.
PRODUCT RECOMMENDATION: Unsupervised learning enables a product based
recommendation system. The algorithm identifies hidden pattern among items and focuses on grouping
similar products into clusters. E-Commerce businesses such as Amazon have this capability.
MEDICAL DIAGNOSIS: Machine Learning in the medical field will improve patient‗s health with
minimum costs. These predictions are based on the dataset of anonym zed patient records and symptoms
exhibited by a patient.
FINANCIAL ANALYSIS: Due to large volume of data, quantitative nature and accurate historical
data, machine learning can be used in financial analysis. Future applications of ML in finance include
chat bots and conversational interfaces for customer service, security and sentiment analysis.
PREDICTIVE MAINTENANCE: Whereas predictive maintenance minimizes the risk of
unexpected failures and reduces the amount of unnecessary preventive maintenance activities. For
predictive maintenance, ML architecture can be built which consists of historical device data, flexible
analysis environment, and workflow visualization tool and operations feedback loop. Azure ML platform
provides an example of simulated aircraft engine run-to-failure events to demonstrate the predictive
maintenance modeling process.
IMAGE RECOGNITION (COMPUTER VISION): Computer vision produces numerical or
symbolic information from images and high-dimensional data. It involves machine learning, data
mining, database knowledge discovery and pattern recognition. This customization requires highly
qualified data scientists or ML consultants.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Models: The output of machine learning
Models form the central concept in machine learning as they are what is being learned from the data, in order to
solve a given task. There is a considerable not to say bewildering range of machine learning models to choose
from.
1) Geometric models
2) Probabilistic models
3) Logical models
4) Grouping and grading
Geometric models: A geometric model is constructed directly in instance space, using geometric concepts such
as lines, planes and distances. One main advantage of geometric classifiers is that they are easy to visualize, as
long as we keep to two or three dimensions.
Probabilistic models: Probabilistic classifier is a classifier that is able to predict, given an observation of an
input, a probability distribution over a set of classes, rather than only outputting the most likely class that the
observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or
when combining classifiers into ensembles.
Logical models: Logic models are hypothesized descriptions of the chain of causes and effects leading to an
outcome of interest (e.g. prevalence of cardiovascular diseases, annual traffic collision, etc). While they can be
in a narrative form, logic model usually take form in a graphical depiction of the "if-then" (causal) relationships
between the various elements leading to the outcome. However, the logic model is more than the graphical
depiction: it is also the theories, scientific evidences, assumptions and beliefs that support it and the various
processes behind it.
Grouping and Grading: Grouping models do this by breaking up the instance space into groups or segments,
the number of which is determined at training time. One could say that grouping models have a fixed and finite
‗resolution‘ and cannot distinguish between individual instances beyond this resolution.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Features: the workhorses of machine learning.
Classification
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Types of classification techniques:
Logistic Regression
Naive Bayes
K-Nearest Neighbors
Decision Tree
Random Forest
Support Vector Machines
Logistic Regression
Is one of the most popular machines learning algorithm which comes under the
supervised learning.
Logistic regression is a categorical variable
Logistic regression used for solving classification problem.
Logistic regression instead of fitting a regression line we fit an S shaped which predict
two max values (0 or 1)
𝐵
𝐴 𝑃 ∗𝑃 𝐴
𝑃 = 𝐴
𝐵 𝑃 𝐵
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
K-Nearest Neighbors
KNN is one of the simplest machine learning algorithms based on supervised learning.
KNN algorithm can be used for classification problem and regression problem.
Mostly used for classification problem.
KNN algorithm use data and classify new data points based on similarity
measures(Distance function)
Euclidean Distance:
2 2
XH − H1 + XW − W1
Where
XH = Observed Value
XW = Observed Value
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Decision Tree
Decision tree is supervised learning technique that can be used for both classification
problem and regression problem.
Mostly used for classification problem.
It is a tree structure classification
Is a popular machine learning algorithm that belongs to the supervised learning technique
It can be used for both classification problem and regression problem
Mostly used for classification problem.
It is based on the concept of model ensemble learning.
Collection of decision tree.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Support Vector Machines
Support Vectors − Datapoints that are closest to the hyperplane is called support vectors. Separating
line will be defined with the help of these data points.
Hyperplane − As we can see in the above diagram, it is a decision plane or space which is divided
between a set of objects having different classes.
Margin − It may be defined as the gap between two lines on the closet data points of different classes. It
can be calculated as the perpendicular distance from the line to the support vectors. Large margin is
considered as a good margin and small margin is considered as a bad margin.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Scoring and ranking
Variable Ranking is the process of ordering the features by the value of some scoring function,
which usually measures feature-relevance. Resulting set: The score S(fi) is computed from the
training data, measuring some criteria of feature fi. By convention a high score is indicative for a
valuable (relevant) feature.
List of scoring modules
Machine Learning Studio (classic) provides many different scoring modules. You select one
depending on the type of model you are using, or the type of scoring task you are performing:
Use this module for all other regression and classification models, as well as some anomaly
detection models.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Coverage curve
A probabilistic classifier assigns the probabilities to each class, where the probability of a particular
class corresponds to the probability of the image belonging to that class. This is called probability
estimation.
Concavity relates to the rate of change of a function's derivative. A function f is concave up (or
upwards) where the derivative f′ is increasing. This is equivalent to the derivative of f′ , which is
f′′f, start superscript, prime, prime, end superscript, being positive.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
Beyond binary classification: Handling more than two classes
1. Multiclass Classification: A classification task with more than two classes; e.g., classify
a set of images of fruits which may be oranges, apples, or pears. Multi-class classification makes
the assumption that each sample is assigned to one and only one label: a fruit can be either an
apple or a pear but not both at the same time.
Problem – Given a dataset of m training examples, each of which contains information in the
form of various features and a label. Each label corresponds to a class, to which the training
example belongs to. In multiclass classification, we have a finite set of classes. Each training
example also has n features.
For example, in the case of identification of different types of fruits, ―Shape‖, ―Color‖, ―Radius‖
can be features and ―Apple‖, ―Orange‖, ―Banana‖ can be different class labels.
In a multiclass classification, we train a classifier using our training data, and use this classifier
for classifying new examples.
Aim– We will use different multiclass classification methods such as, KNN, Decision trees,
SVM, etc. We will compare their accuracy on test data.
Approach –
Multiclass perceptrons provide a natural extension to the multi-class problem. Instead of just
having one neuron in the output layer, with binary output, one could have N binary neurons
leading to multi-class classification. In practice, the last layer of a neural network is usually a
softmax function layer, which is the algebraic simplification of N logistic classifiers, normalized
per class by the sum of the N-1 other logistic classifiers.
Decision tree classifier – Decision tree classifier is a systematic approach for multiclass
classification. It poses a set of questions to the dataset (related to its attributes/features). The
decision tree classification algorithm can be visualized on a binary tree. On the root and each of
the internal nodes, a question is posed and the data on that node is further split into separate
records that have different characteristics. The leaves of the tree refer to the classes in which the
dataset is split.
SVM (Support vector machine) classifier
SVM (Support vector machine) is an efficient classification method when the feature vector is
high dimensional.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
1
Regression
Regression
Types of Regression
Linear Regression
Polynomial Regression
Support Vector Regression
Decision Tree Regression
Random Forest Regression
Linear Regression
Y=b0+b1*X
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
2
Where:
Y= Dependent variable
X= Independent variable
B0=Intercept
1. Positive Regression
If the dependent variable increase on the Y-axis and independent variable on X- axis then
such a relationship is termed as a POSITIVE Regression
Y=b0+b1*X
2. Negative Regression
If the dependent variable decrease on the Y-axis and independent variable increase on the
X-axis then such relationship is called a negative regression
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
3
Y= - b0+b1*X
Example: Linear Regression Using Least square method
Independent Dependent
variable Variable 𝟐
𝑿−𝑿 𝒀−𝒀 𝑿−𝑿 (𝑿 − 𝑿)*(𝒀 − 𝒀)
X Y
1 2 1-3=-2 2-4=-2 4 4
2 4 2-3=-1 4-4=0 1 0
3 5 3-3=0 5-4=1 0 0
4 4 4-3=1 4-4=0 1 0
5 5 5-3=2 5-4=1 4 2
3 4 10 6
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
4
𝑋−𝑋 ∗ 𝑌−𝑌 6
𝑏1 = = 0.6
𝑋−𝑋 2 10
4 = b0+0.6(3)
4 = b0+1.8
-1.8 = - 1.8
2.2 =b0
Polynomial Regression
Y=a+b*x^2
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
5
In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into
the data points.
Support Vector Machine‖ (SVM) is a supervised machine learning algorithm which can be used
for both classification and regression challenges. However, it is mostly used in classification
problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is
Number of features you have) with the value of each feature being the value of a particular
coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two
classes very well.
Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine
is a frontier which best segregates the two classes (hyper-plane/ line).
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
6
Similar to decision tree classification, however uses mean squared error or similar metrics
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of
a procedure for estimating an unobserved quantity) measures the average of the squares of the
errors—that is, the average squared difference between the estimated values and what is actually
estimated. The MSE is a measure of the quality of an estimator—it is always non-negative, and
values closer to zero are better.
Identify n where n is the number of decision tree regression to be created. Repeat step 1
and 2 to create several regression trees.
The average of each branch is assigned to leaf node in each decision tree.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
7
To predict output for a variable, the average of all the predictions of all decision trees are
taken into consideration.
Random Forest prevents over fitting (which is common in decision trees) by creating random
subsets of the features and building smaller trees using these subsets.
Unsupervised Learning
Unsupervised learning is a type of algorithm that learns patterns from untagged data.
It mainly deal with the unlabelled data
Unsupervised learning algorithm allows users to perform more complex processing task
compared to supervised learning.
Clustering
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
8
Exclusive cluster
Overlap cluster
Hierarchical
Exclusive (partitioning)
In this clustering method, Data are grouped in such a way that one data can belong to one cluster
only.
Example: K-means
Agglomerative
In this clustering technique, every data is a cluster. The iterative unions between the two nearest
clusters reduce the number of clusters.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
9
Overlapping
In this technique, fuzzy sets are used to cluster data. Each point may belong to two or more
clusters with separate degrees of membership.
Descriptive learning
Descriptive Learning: Using descriptive analysis you came up with the idea that, two products A
(Burger) and B (french fries) are brought together with very high frequency.
Now you want that if user buys A then machine should automatically give him a suggestion to buy B. So
by seeing past data and deducing what could be the possible factors influencing this situation can be
achieved using ML.
Predictive Learning: We want to increase our sales, using descriptive learning we came to know about
what could be the possible factors influencing sales. By tuning the parameters in such a way so that sales
should be maximized in the next quarter, and therefore predicting what sales we could generate and hence
making investments accordingly. This task can be handled using ML also.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
10
The most studied task in machine learning is inferring a function that classifies examples represented in
some language as members or nonmembers of a concept from pre classified training examples. This is
called concept learning or classification
Hypothesis Testing
Multi step procedure that lead the researcher from the hypothesis statement to the decision regarding the
hypothesis.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
11
A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis that
an algorithm would come up depends upon the data and also depends upon the restrictions and bias that we
have imposed on the data. To better understand the Hypothesis Space and Hypothesis consider the following
coordinate that shows the distribution of some data:
Say suppose we have test data for which we have to determine the outputs or results. The test data is as shown
below:
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
12
But note here that we could have divided the coordinate plane as:
The way in which the coordinate would be divided depends on the data, algorithm and constraints.
All these legal possible ways in which we can divide the coordinate plane to predict the outcome of the
test data composes of the Hypothesis Space.
Each individual possible way is known as the hypothesis.
Hence, in this example the hypothesis space would be like:
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
13
We have not one but two most general hypotheses. What we can also notice is that every concept
between the least general one and one of the most general ones is also a possible hypothesis, i.e.,
covers all the positives and none of the negatives. Mathematically speaking we say that the set of
Algorithm LGG-Conj-ID(x, y) – find least general conjunctive generalization of two
conjunctions, employing internal disjunction.
Input : conjunctions x, y.
Output : conjunction z.
1. z ←true;
5. end
6. end
7. return z
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
14
Tree models
Tree-based machine learning methods are among the most commonly used
supervised learning methods. Tree-based ML methods are built by recursively splitting a training
sample, using different features from a dataset at each node that splits the data most effectively.
Decision Tree
Decision Tree is a supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent the
decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into sub trees.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
15
Algorithm
1. Start with a training data set which we‘ll call S. It should have attributes and classification.
2. Determine the best attribute in the dataset. (We will go over the definition of best attribute)
3. Split S into subset that contains the possible values for the best attribute.
5. Recursively generate new decision trees by using the subset of data created from step 3 until a
stage is reached where you cannot classify the data further. Represent the class as leaf node.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
16
Example
Formulas
Infogain
𝑝 𝑝 𝑛 𝑛
I(p,n) = - log2 𝑠 − log2 𝑠
𝑠 𝑠
Entropy
𝑝 𝑖 +𝑛 𝑖
E(A) = . I 𝑝𝑖 , 𝑛𝑖
𝑝+𝑛
Gain
= I(p,n) – E(A)
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
17
Infogain
9 9 5 5
=- log2 14 + log2 14
14 14
= 0.409+0.530 = 0.940
2 2 3 3
I(Outlook,Sunny) = log2 5 + log2 5
5 5
= 0.971
I(Outlook, Overcast) = 0
= 0.940-0.694 = 0.246
Attribute Gain
Outlook 0.246 First Splitting Point
Temperature 0.029
Humidity 0.151
Wind 0.048
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
18
Outlook = Rain
3 3 2 2
= log2 5 + log2 5
5 5
= 0.97
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
19
Advantage
Disadvantages
The decision tree contains lots of layers, which makes it complex.
It may have an over fitting issue, which can be resolved using the Random Forest algorithm.
For more class labels, the computational complexity of the decision tree may increase.
Application
Decision tree has been used to develop models for prediction and classification in different domains some of
which are
Business management
Customer relationship management
Fraudulent statement detection
Engineering, Energy consumption
Fault diagnosis
Healthcare Management
Agriculture
Decision tree learning algorithms are commonly used in machine learning for
classification problems. A tree is defined as a set of logical conditions on attributes; a leaf represents
the subset of instances corresponding to the conjunction of conditions along its branch, or path
back to the root.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
20
An instance being classified is passed along the tree to a leaf and is assigned the majority class label of
that leaf. For many applications, it is useful to order instances, which involves assigning them a
rank, rather than a class label. For example, a web-page recommender may want to order web pages by
the likelihood of their being of interest to the user, instead of classifying them as ―of interest‖ and ―not
of interest.‖ A simple approach to ranking is to estimate the probability of an instance's
membership in a class, and assign that probability as the instance's rank. A decision tree can
easily be used to estimate these probabilities. If a leaf node contains class frequencies n1, n2,..., nc, the
probability that an instance falling in that leaf belongs to class i can be defined as ni ∑ni . Decision trees
acting as probability estimators, however, are often observed to produce bad probability
estimates. Specifically, every instance in a node is assigned the same probability, resulting in a
proliferation of ties, which reduces the results' usefulness for ranking. Sparse training sets may also
lead to skewed estimates. Most developments in decision tree learning algorithms have aimed at
improving classification accuracy rather than probability estimates. We explore two techniques
for producing improved probability estimates.
Reduction in Variance
Reduction in variance is an algorithm used for continuous target variables (regression problems). This algorithm
uses the standard formula of variance to choose the best split. The split with lower variance is selected as the
criteria to split the population:
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
21
Above X-bar is the mean of the values, X is actual and n is the number of values.
Steps to calculate Variance:
Calculate variance for each node.
Calculate variance for each split as the weighted average of each node variance.
Chi-Square
The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the oldest tree
classification methods. It finds out the statistical significance between the differences between sub-nodes and
parent node. We measure it by the sum of squares of standardized differences between observed and expected
frequencies of the target variable. It works with the categorical target variable ―Success‖ or ―Failure‖. It can
perform two or more splits. Higher the value of Chi-Square higher the statistical significance of differences
between sub-node and Parent node.
Mathematically, Chi-squared is represented as:
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
22
Rule models
Rule-based methods are a popular class of techniques in machine learning and data mining. They share the
goal of finding regularities in data that can be expressed in the form of an IF-THEN rule. Depending on the
type of rule that should be found, we can discriminate between descriptive rule discovery, which aims at
describing significant patterns in the given dataset in terms of rules, and predictive rule learning. In the latter
case, one is often also interesting in learning a collection of the rules that collectively cover the instance space
in the sense that they can make a prediction for every possible instance. In the following, we will briefly
introduce both tasks and point out some key works in this area. While in some application areas rule learning
algorithms are superseded by statistical approaches such as Support Vector Machines (SVMs). An emerging
use case for rule learning is the Semantic Web, whose representation is built on rule-based formalisms.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
23
Descriptive, Which produces new, nontrivial information based on the available data set
Descriptive is used to learn about and understand the data
Ex. Identify and describe groups of customers with common buying behavior
We discussed algorithms for learning sets of propositional (i.e., variable-free) rules. In this section, we consider
learning rules that contain variables-in particular, learning first-order Horn theories. Our motivation for
considering such rules is that they are much more expressive than propositional rules. Inductive learning of
first-order rules or theories is often referred to as inductive logic programming (or LP for short), because this
process can be viewed as automatically inferring PROLOG programs from examples. PROLOG is a general
purpose, Turing-equivalent programming language in which programs are expressed as collections of Horn
clauses.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
24
Linear models
Linear models generate a formula to create a best-fit line to predict unknown values. Linear models are
considered ―old school‖ and often not as predictive as newer algorithm classes, but they can be trained
relatively quickly and are generally more straightforward to interpret,
1. Linear Regression
Linear regression is one of the most basic types of regression in machine learning. The
linear regression model consists of a predictor variable and a dependent variable related
linearly to each other. In case the data involves more than one independent variable,
then linear regression is called multiple linear regression models.
y=mx+c+e
where m is the slope of the line, c is an intercept, and e represents the error in the model.
The best fit line is determined by varying the values of m and c. The predictor error is the
difference between the observed values and the predicted value. The values of m and c get
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
25
selected in such a way that it gives the minimum predictor error. It is important to note that a
simple linear regression model is susceptible to outliers. Therefore, it should not be used in
case of big size data.
2. Logistic Regression
Is one of the most popular machine learning algorithm which comes under the supervised
learning.
Logistic regression is a categorical variable
Logistic regression used for solving classification problem.
Logistic regression instead of fitting a regression line we fit an S shaped which predict
two max values (0 or 1)
Imagine you have some points, and want to have a line that best fits them like this:
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
26
We can place the line "by eye": try to have the line as close as possible to all points, and a similar number of points
above and below the line.
But for better accuracy let's see how to calculate the line using Least Squares Regression.
The Line
Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a line :
y = mx + b
Where:
y = how far up
x = how far along
m = Slope or Gradient (how steep the line is)
b = the Y Intercept (where the line crosses the Y axis)
Steps
Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy (Σ means "sum up")
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
27
Example
Sam found how many hours of sunshine vs how many ice creams were sold at the shop from Monday to Friday:
"x" "y"
Hours of Sunshine Ice Creams Sold
2 4
3 5
5 7
7 10
9 15
Let us find the best m (slope) and b (y-intercept) that suits that data
y = mx + b
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
28
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
= 249164 = 1.5183...
b = Σy − m ΣxN
= 41 − 1.5183 x 265
= 0.3049...
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
29
y = mx + b
y = 1.518x + 0.305
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03
Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he uses the above equation to
estimate that he will sell
Sam makes fresh waffle cone mixture for 14 ice creams just in case.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
30
The least squares method is a statistical procedure to find the best fit for a set of data points by
minimizing the sum of the offsets or residuals of points from the plotted curve.
Least squares regression is used to predict the behavior of dependent variables.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
31
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
32
Perceptron: Perceptron is a single layer neural network and a multi-layer perceptron is called Neural
Networks.Perceptron is a linear classifier (binary). Also, it is used in supervised learning. It helps to classify the
3. Net sum
4. Activation Function
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
33
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
34
1) Support Vector: Data points that closest to the hyper plan is called support vector
2) Hyper plan: It is a decision plan space which is divided between a set of objects
having different classes.
3) Max Margin: it may be defined as the gap between lines on the closet data points
of different classes and it can be calculated as the perpendicular distance from the
line to the support vector. Max Margin is considering as a GOOD Margin. Small
margin-BAD Margin
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
35
Advantages
Disadvantages
Applications
Image Classification
Face Detect
Handwriting recognition
Text Categorization
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
36
Logistic calibration
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
37
Distance is applied through the concept of neighbours and exemplars. Neighbours are points in proximity with
respect to the distance measure expressed through exemplars. Exemplars are either centroids that find a centre
of mass according to a chosen distance metric or medoids that find the most centrally located data point. The
most commonly used centroid is the arithmetic mean, which minimises squared Euclidean distance to all other
points.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
38
K-NN is one of the simplest machine learning algorithm based supervised learning
technique.
K-NN algorithm can be used for regression problems and classification but mostly used for
classification problem.
K-NN algorithm use data and classify new data points based on similarity measures (Distance
function).
Euclidean Distance:
2 2
XH − H1 + XW − W1
Where
XH = Observed Value
XW = Observed Value
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
39
Example:
Perform KNN classification algorithm on following dataset and predict the class for x
(P1=3 and P2=7) K=3.
P1 P2 CLASS
i 7 7 FALSE
ii 7 4 FALSE
iii 3 4 TRUE
iv 1 4 TRUE
3 7
2 2
XH − H1 + XW − W1
2 2
D(X,i) = 𝟑−7 + 7−7
= −4 2 + 0 2 = 16 = 4
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
40
2 2
D(X,ii) = 3−7 + 7−4
= −4 2 + 3 2 = 16 + 9 = 25 = 5
2 2
D(X,iii) = 3−3 + 7−4
= 0 2 + 3 2 9= 3
2 2
D(X,iv) = 3−1 + 7−4
= 2 2 + 3 2 = 13 = 3.60
K=3
Application
Used in classification
Used in get missing values
Used in pattern recognition
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
41
K-Means Algorithm
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
42
Example:
Divide the given sample data in two clusters using K-Means Algorithm [Euclidean Distance]
Height(H) Weight(W)
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Euclidean Distance:
2 2
X H − H1 + X W − W1
Where
XH = Observed value
XW = Observed value
W= Actual value
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
43
H W Centroid
c1 185 74 (185,74)
c2 169 58 (169,58)
Euclidean Distance: Row 3
2 2
C1 : 168 − 185 + 60 − 72
2 2
= −17 + −12
= 289 + 144
= 433 = 20.80
2 2
C2 : 168 − 170 + 60 − 56
2 2
= 2 + 4
= 4 + 16
= 20 = 4.47
C2 = 170+168 60+56
2
, 2
338 116
= 2
, 2
= (169,58)
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
44
2 2
C1 : 179 − 185 + 68 − 72
2 2
= 6 + 4
= 36 + 16
= 52 = 7.211
2 2
C2 : 179 − 169 + 68 − 58
2 2
= 10 + 10
= 100 + 100
= 200 = 14.14
185+179 72+68
C1 = ,
2 2
364 140
= ,
2 2
= (182,70)
Euclidean Distance: Row 5
2 2
C1 : 182 − 182 + 72 − 70
2 2
= 0 + 2
= 4 = 2
2 2
C2 : 182 − 169 + 72 − 58
2 2
= 13 + 14
= 169 + 196
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
45
= 365 = 19.19
182+182 72+70
C1 = 2
, 2
364 142
= ,
2 2
= (182,71)
Euclidean Distance: Row 6
2 2
C1 : 188 − 182 + 77 − 71
2 2
= 6 + 6
= 36 + 36
= 72 = 8.48
2 2
C2 : 188 − 169 + 77 − 58
2 2
= 19 + 19
= 361 + 361
= 722 = 26.87
188+182 77+71
C1 = 2
, 2
370 148
= 2
, 2
= (185,74)
C1 { 1,4,5,6,7,8,9,10,11,12}
C2 { 2,3,
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
46
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying
different clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data
space are divided from each other by sparser areas. These algorithms can face difficulty in clustering the data
points if the dataset has varying densities and high dimensions.
In the distribution model-based clustering method, the data is divided based on the probability of how a
dataset belongs to a particular distribution. The grouping is done by assuming some distributions
commonly Gaussian distribution. The example of this type is the Expectation-Maximization Clustering
algorithm that uses Gaussian Mixture Models (GMM).
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
47
Hierarchical Clustering.
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement of
pre-specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to
create a tree-like structure, which is also called a dendrogram. The observations or any number of clusters
can be selected by cutting the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.
Srikanth Mandela
M.Tech.,(Ph.D)
Asst.Prof, Dept of CSE, SIET
48
Probabilistic models
The third family of machine learning algorithms is the probabilistic models. We have seen before that
the k-nearest neighbor algorithm uses the idea of distance (e.g., Euclidian distance) to classify entities,
and logical models use a logical expression to partition the instance space. We see how the probabilistic
models use the idea of probability to classify new entities.
Probabilistic models see features and target variables as random variables. The process of modeling
represents and manipulates the level of uncertainty with respect to these variables. There are two types
of probabilistic models: Predictive and Generative. Predictive probability models use the idea of a
conditional probability distribution P (Y |X) from which Y can be predicted from X. Generative
models estimate the joint distribution P (Y, X). Once we know the joint distribution for the generative
models, we can derive any conditional or marginal distribution involving the same variables. Thus, the
generative model is capable of creating new data points and their labels, knowing the joint probability
distribution. The joint distribution looks for a relationship between two variables. Once this relationship
is inferred, it is possible to infer new data points.
Naïve Bayes Algorithm
Naïve bayes is a supervised learning algorithm based on bayes theory and used to
solving classification problems.
It is mainly used in Text classification.
Naïve bayes classifier is one the simplest and most effective classification
algorithm which helps in building the fast machine learning algorithm that can
make quick predictions.
𝐵
𝐴 𝑃 ∗𝑃 𝐴
𝑃 = 𝐴
𝐵 𝑃 𝐵
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
49
Geometric interpretation
Row vectors as points or arrows in n-dimension space
Very intuitive, good for visualization
Use techniques from geometry and linear algebra
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
50
Feature transformations
Feature transformation (FT) refers to family of algorithms that create new features using the existing
features. These new features may not have the same interpretation as the original features, but they may
have more discriminatory power in a different space than the original space. This can also be used for
feature reduction. FT may happen in many ways, by simple/linear combinations of original features or
using non-linear functions. Some common techniques for FT are:
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
51
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
52
Model ensembles
Ensemble learning helps improve machine learning results by combining several models. This approach
allows the production of better predictive performance compared to a single model. That is why ensemble
methods placed first in many prestigious machine learning competitions, such as the Netflix Competition,
KDD 2009, and Kaggle.The Statsbot team wanted to give you the advantage of this approach and asked a
data scientist, VadimSmolyakov, to dive into three basic ensemble learning techniques.Ensemble
methods are meta-algorithms that combine several machine learning techniques into one predictive model
in order to decrease variance (bagging), bias (boosting), or improve predictions (stacking).
Sequential ensemble methods where the base learners are generated sequentially (e.g.
AdaBoost).The basic motivation of sequential methods is to exploit the dependence between the
base learners. The overall performance can be boosted by weighing previously mislabeled
examples with higher weight.
Parallel ensemble methods where the base learners are generated in parallel (e.g. Random
Forest).The basic motivation of parallel methods is to exploit independence between the base
learners since the error can be reduced dramatically by averaging.
Most ensemble methods use a single base learning algorithm to produce homogeneous base learners, i.e.
learners of the same type, leading to homogeneous ensembles.
There are also some methods that use heterogeneous learners, i.e. learners of different types, leading
to heterogeneous ensembles. In order for ensemble methods to be more accurate than any of its individual
members, the base learners have to be as accurate as possible and as diverse as possible.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
53
Bootstrap Aggregation (or Bagging for short), is a simple and very powerful ensemble method. Bagging
is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically
decision trees.
1. Suppose there are N observations and M features. A sample from observation is selected randomly
with replacement(Bootstrapping).
2. A subset of features are selected to create a model with sample of observations and subset of
features.
3. Feature from the subset is selected which gives the best split on the training data.(Visit my blog on
Decision Tree to know more of best split)
4. This is repeated to create many models and every model is trained in parallel
5. Prediction is given based on the aggregation of predictions from all the models.
When bagging with decision trees, we are less concerned about individual trees overfitting the training
data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training
samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high
variance and low bias. These are important characterize of sub-models when combining predictions using
bagging. The only parameters when bagging decision trees is the number of samples and hence the
number of trees to include. This can be chosen by increasing the number of trees on run after run until the
accuracy begins to stop showing improvement.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
54
Figure: Bagging
Algorithm
Advantages
Efficient on large datasets
More accurate than decision trees
Averaging results of many trees reduces variance
Disadvantages
More difficult to interpret than decision trees
Less clear which variable are of greatest importance for predicting the response
More computationally intensive than forming a single decision tree
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
55
Random forests
Random forest is a supervised learning algorithm which is used for both classification as well as
regression. But however, it is mainly used for classification problems. As we know that a forest is made
up of trees and more trees means more robust forest. Similarly, random forest algorithm creates
decision trees on data samples and then gets the prediction from each of them and finally selects the
best solution by means of voting. It is an ensemble method which is better than a single decision tree
because it reduces the over-fitting by averaging the result.
We can understand the working of Random Forest algorithm with the help of following steps −
Step 1 − First, start with the selection of random samples from a given dataset.
Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the
prediction result from every decision tree.
Step 3 − In this step, voting will be performed for every predicted result.
Step 4 − At last, select the most voted prediction result as the final prediction result.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
56
Advantages
1. No interpretability
2. Overfitting can easily occur
3. Need to choose the number of trees
So basically Random forest is used when you are just looking for high performance with less need for
interpretation.
Boosting
Boosting is a technique to combine weak learners and convert them into strong ones with the help of
Machine Learning algorithms. It uses ensemble learning to boost the accuracy of a model. Ensemble learning
is a technique to improve the accuracy of Machine Learning models. There are two types of ensemble
learning:
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
57
It is a boosting technique where the outputs from individual weak learners associate sequentially during the
training phase. The performance of the model is boosted by assigning higher weights to the samples that are
incorrectly classified. AdaBoost algorithm is an example of sequential learning that we will learn later in this
blog.
It is a bagging technique where the outputs from the weak learners are generated parallelly. It reduces errors
by averaging the outputs from all weak learners. The random forest algorithm is an example of parallel
ensemble learning.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
58
Boosting is creating a generic algorithm by considering the prediction of the majority of weak learners. It
helps in increasing the prediction power of the Machine Learning model. This is done by training a series of
weak models.
Below are the steps that show the mechanism of the boosting algorithm:
1. Reading data
4. Assigning the false prediction, along with a higher weightage, to the next learner
Now, we will explore various interpretations of weakness and their corresponding algorithms.
Adaptive boosting is a technique used for binary classification. For implementing AdaBoost, we use short
decision trees as weak learners.
3. Each weak learner consists of a decision tree; analyze the output of each decision tree and assign higher
weights to the misclassified results. This gives more significance to the prediction with higher weights.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
59
4. Continue the process until the model becomes capable of predicting the accurate result
2. Gradient Boosting
In Machine Learning, we use gradient boosting to solve classification and regression problems. It is a
sequential ensemble learning technique where the performance of the model improves over iterations. This
method creates the model in a stage-wise fashion. It infers the model by enabling the optimization of an
absolute differentiable loss function. As we add each weak learner, a new model is created that gives a more
precise estimation of the response variable.
1. Loss function: To reduce errors in prediction, we need to optimize the loss function. Unlike in AdaBoost,
the incorrect result is not given a higher weightage in gradient boosting. It tries to reduce the loss function by
averaging the outputs from weak learners.
2. Weak learner: In gradient boosting, we require weak learners to make predictions. To get real values as
output, we use regression trees. To get the most suitable split point, we create trees in a greedy manner, due
to this the model overfits the dataset.
3. Additive model: In gradient boosting, we try to reduce the loss by adding decision trees. Also, we can
minimize the error rate by cutting down the parameters. So, in this case, we design the model in such a way
that the addition of a tree does not change the existing tree.
Finally, we update the weights to minimize the error that is being calculated.
3. XGBoost
XGBoost algorithm is an extended version of the gradient boosting algorithm. It is basically designed to
enhance the performance and speed of a Machine Learning model.
Additionally, we have an XGBoosting library, which gives us frameworks of gradient boosting for various
languages such as R, Python, Java, etc.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
60
Advantages
The training and test error rates are both theoretically bounded
Less ―Over fitting‖ in practice
Many algorithms can be boosted
Easy to implement
Disadvantages
Learning is slow
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
61
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
62
Dimensionality Reduction
In machine learning classification problems, there are often too many factors on the basis of which the
final classification is done. These factors are basically variables called features. The higher the number
of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these
features are correlated, and hence redundant. This is where dimensionality reduction algorithms come
into play. Dimensionality reduction is the process of reducing the number of random variables under
consideration, by obtaining a set of principal variables. It can be divided into feature selection and
feature extraction.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
63
Feature selection: In this, we try to find a subset of the original set of variables, or features, to get
a smaller subset which can be used to model the problem. It usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
Feature extraction: This reduces the data in a high dimensional space to a lower dimension space,
i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
64
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
65
Example:
Suppose we have two sets of data points belonging to two different classes that we want to classify. As
shown in the given 2D graph, when the data points are plotted on the 2D plane, there‘s no straight line
that can separate the two classes of the data points completely. Hence, in this case, LDA (Linear
Discriminant Analysis) is used which reduces the 2D graph into a 1D graph in order to maximize the
separability between the two classes.
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and projects data
onto a new axis in a way to maximize the separation of the two categories and hence, reducing the 2D
graph into a 1D graph.
Two criteria are used by LDA to create a new axis:
1. Maximize the distance between means of the two classes.
2. Minimize the variation within each class.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
66
In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D graph such
that it maximizes the distance between the means of the two classes and minimizes the variation within
each class. In simple terms, this newly generated axis increases the separation between the dtla points of
the two classes. After generating this new axis using the above-mentioned criteria, all the data points of
the classes are plotted on this new axis and are shown in the figure given below.
But Linear Discriminate Analysis fails when the mean of the distributions are shared, as it becomes
impossible for LDA to find a new axis that makes both the classes linearly separable. In such cases, we
use non-linear discriminant analysis.
Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is used such as
splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the
variance (actually covariance), moderating the influence of different variables on LDA.
Applications:
1. Face Recognition: In the field of Computer Vision, face recognition is a very popular application
in which each face is represented by a very large number of pixel values. Linear discriminant
analysis (LDA) is used here to reduce the number of features to a more manageable number before
the process of classification. Each of the new dimensions generated is a linear combination of pixel
values, which form a template. The linear combinations obtained using Fisher‘s linear discriminant
are called Fisher faces.
2. Medical: In this field, Linear discriminant analysis (LDA) is used to classify the patient disease
state as mild, moderate or severe based upon the patient various parameters and the medical
treatment he is going through. This helps the doctors to intensify or reduce the pace of their
treatment.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
67
3. Customer Identification: Suppose we want to identify the type of customers which are most
likely to buy a particular product in a shopping mall. By doing a simple question and answers
survey, we can gather all the features of the customers. Here, Linear discriminant analysis will
help us to identify and select the features which can describe the characteristics of the group of
customers that are most likely to buy that particular product in the shopping mall.
General Discriminant Analysis (GDA)
General Discriminant Analysis (GDA) is called a "general" discriminant analysis because it applies the
methods of the general linear model (see also General Linear Models (GLM)) to the discriminant
function analysis problem. A general overview of discriminant function analysis, and the traditional
methods for fitting linear models with categorical dependent variables and continuous predictors, is
provided in the context of Discriminant Analysis. In GDA, the discriminant function analysis problem is
"recast" as a general multivariate linear model, where the dependent variables of interest are (dummy-)
coded vectors that reflect the group membership of each case. The remainder of the analysis is then
performed as described in the context of General Regression Models (GRM), with a few additional
features noted below.
Advantages of Dimensionality Reduction
It helps in data compression, and hence reduced storage space.
It reduces computation time.
It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
It may lead to some amount of data loss.
PCA tends to find linear correlations between variables, which is sometimes undesirable.
PCA fails in cases where mean and covariance are not enough to define datasets.
We may not know how many principal components to keep- in practice, some thumb rules are
applied.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
68
The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set
consisting of many variables correlated with each other, either heavily or lightly, while retaining the
variation present in the dataset, up to the maximum extent. The same is done by transforming the
variables to a new set of variables, which are known as the principal components (or simply, the PCs)
and are orthogonal, ordered such that the retention of variation present in the original variables decreases
as we move down in the order. So, in this way, the 1st principal component retains maximum variation
that was present in the original components. The principal components are the eigenvectors of a
covariance matrix, and hence they are orthogonal.
Importantly, the dataset on which PCA technique is to be used must be scaled. The results are also
sensitive to the relative scaling. As a layman, it is a method of summarizing data. Imagine some wine
bottles on a dining table. Each wine is described by its attributes like colour, strength, age, etc. But
redundancy will arise because many of them will measure related properties. So what PCA will do in
this case is summarize each wine in the stock with less characteristics.
Intuitively, Principal Component Analysis can supply the user with a lower-dimensional picture, a
projection or "shadow" of this object when viewed from its most informative viewpoint.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
69
Dimensionality It is the number of random variables in a dataset or simply the number of features, or
rather more simply, the number of columns present in your dataset.
Correlation: It shows how strongly two variable are related to each other. The value of the same ranges
for -1 to +1. Positive indicates that when one variable increases, the other increases as well, while
negative indicates the other decreases on increasing the former. And the modulus value of indicates the
strength of relation.
Orthogonal: Uncorrelated to each other, i.e., correlation between any pair of variables is 0.
Eigenvectors: Eigenvectors and Eigenvalues are in itself a big domain, let‘s restrict ourselves to the
knowledge of the same which we would require here. So, consider a non-zero vector v. It is an
eigenvector of a square matrix A, if Av is a scalar multiple of v. Or simply:
Av = ƛv
Here, v is the eigenvector and ƛ is the eigenvalue associated with it.
Covariance Matrix: This matrix consists of the covariances between the pairs of variables. The (i,j)th
element is the covariance between i-th and j-th variable.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
70
Applications
PCA is predominantly used as a dimensionality reduction technique in domains like facial recognition,
computer vision and image compression. It is also used for finding patterns in data of high dimension in
the field of finance, data mining, bioinformatics, psychology, etc.
Image processing
Speech recognition
Recommendation engines
Text processing
Back propagation
Backpropagation is the essence of neural network training. It is the method of fine-tuning the weights of
a neural network based on the error rate obtained in the previous epoch (i.e., iteration). Proper tuning of
the weights allows you to reduce error rates and make the model reliable by increasing its
generalization. Backpropagation in neural network is a short form for "backward propagation of errors."
It is a standard method of training artificial neural networks. This method helps calculate the gradient of
a loss function with respect to all the weights in the network.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
71
Static Back-propagation
Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static output. It
is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After that, the
error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is nonstatic in recurrent backpropagation.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
72
An Artificial Neuron Network (ANN), popularly known as Neural Network is a computational model
based on the structure and functions of biological neural networks. It is like an artificial human nervous
system for receiving, processing, and transmitting information in terms of Computer Science.
1. Input Layer (All the inputs are fed in the model through this layer)
2. Hidden Layers (There can be more than one hidden layers which are used for processing the inputs
received from the input layers)
3. Output Layer (The data after processing is made available at the output layer)
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
73
Input Layer
The Input layer communicates with the external environment that presents a pattern to the neural
network. Its job is to deal with all the inputs only. This input gets transferred to the hidden layers which
are explained below. The input layer should represent the condition for which we are training the neural
network. Every input neuron should represent some independent variable that has an influence over the
output of the neural network
Hidden Layer
The hidden layer is the collection of neurons which has activation function applied on it and it is an
intermediate layer found between the input layer and the output layer. Its job is to process the inputs
obtained by its previous layer. So it is the layer which is responsible extracting the required features from
the input data. Many researches has been made in evaluating the number of neurons in the hidden layer
but still none of them was successful in finding the accurate result. Also there can be multiple hidden
layers in a Neural Network. So you must be thinking that how many hidden layers have to be used for
which kind of problem. Suppose that if we have a data which can be separated linearly, then there is no
need to use hidden layer as the activation function can be implemented to input layer which can solve the
problem. But in case of problems which deals with complex decisions, we can use 3 to 5 hidden layers
based on the degree of complexity of the problem or the degree of accuracy required. That certainly not
means that if we keep on increasing the number of layers, the neural network will give high accuracy! A
stage comes when the accuracy becomes constant or falls if we add an extra layer! Also, we should also
calculate the number of nuerons in each network. If the number of neurons are less as compared to the
complexity of the problem data then there will be very few neurons in the hidden layers to adequately
detect the signals in a complicated data set. If unnecessary more neurons are present in the network then
Overfitting may occur. Several methods are used till now which do not provide the exact formula for
calculating the number of hidden layer as well as number of neurons in each hidden layer.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
74
Output Layer
The output layer of the neural network collects and transmits the information accordingly in way it has
been designed to give. The pattern presented by the output layer can be directly traced back to the input
layer. The number of neurons in output layer should be directly related to the type of work that the neural
network was performing. To determine the number of neurons in the output layer, first consider the
intended use of the neural network.
Flowchart
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
75
Advantages
Easy to conceptualize
Capable of detecting complex relationships
Large amount of academic research
Used extensively in industry for many years
Provide high speed calculations
Can handle large number of feathers
Can solve any machine learning problem
Disadvantages
Neural networks are too much of a black box this makes them difficult to train
There are alternatives that are simpler, faster, easier to train and perform better.
Can not resolve all problems of learning machine
Neural networks are not probabilistic
Neural networks are not a substitute for understanding your problem
Application
Handwriting recognition
Image compression
Signal processing
Pattern recognition
Traveling salesman problem
Stock exchange prediction
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
76
how the image of a forward-mounted camera is mapped to 960 neuralnetwork inputs, which are fed forward
to 4 hidden units, connected to 30 output units. The figure on the right shows weight values forone of the
hidden units in this network. The 30 x 32 weights into the hidden unit are displayed inthe large matrix, with
white blocks indicating positive and black indicating negative weights. Theweights from this hidden unit to
the 30 output units are depicted by the smaller rectangular blockdirectly above the large block.
As can be seen from these output weights, activation of this particularhidden unit encourages a turn toward
the left. It is appropriatefor problems with the following characteristics:
Instances are represented by many attribute-value pairs. The target function to be learned is
defined over instances that can be described by a vector of predefined features, such as the pixel
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
77
values in the ALVINN example. These input attributes may be highly correlated or independent
of one another.Input values can be any real values.
The target function output may be discrete-valued, real-valued, or a vector of several real- or
discrete-valued attributes. For example, in the ALVINN system the output is a vector of 30
attributes, each corresponding to a recommendation regarding the steering direction. The value
of each output is some real number between 0 and 1, which in this case corresponds to the
confidence in predicting the corresponding steering direction. We can also train a single network
to output both the steering command and suggested acceleration, simply by concatenating the
vectors that encode these two outputpredictions.
The training examples may contain errors. ANN learning methods are quiterobust to noise in the
training data.
Long training times are acceptable. Network training algorithms typically require longer training
times than, say, decision tree learning algorithms. Training times can range from a few seconds
to many hours, depending on factors such as the number of weights in the network, the number
of training examples considered, and the settings of various learning algorithmparameters.
Fast evaluation of the learned target function may be required. Although ANN learning times are
relatively long, evaluating the learned network, in order to apply it to a subsequent instance, is
typically very fast. For example, ALVINN applies its neural network several times per second to
continuallyupdate its steering command as the vehicle drives forward.
The ability of humans to understand the learned target function is not important. The weights
learned by neural networks are often difficult for humans to interpret. Learned neural networks
are less easily communicated to humansthan learned rules.
We first consider several alternative designs for the primitive units that make up artificial neural
networks (percetrons, linear units, and sigmoid units), along with learning algorithms for training
single units. We then present the Back propagation algorithm for training multilayer networks of
such units and consider several general issues such as therepresentational capabilities of ANNs,
nature of the hypothesis space search, over fittingproblems, and alternatives to the Back propagation
algorithm. A detailedexample is also presented applying Back propagation algorithm
facerecognition, anddirections are provided for the reader to obtain the data and code to
experimentfurther with this application.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
78
Decision regions of a multilayer feed forward network. The network shown here was trained to
recognize 1 of 10 vowel sounds occurring in the context "hd" (e.g., "had," "hid"). The network input
consists of two parameters, F1 and F2, obtained from a spectral analysis of the sound. The 10 network
outputs correspond to the 10 possible vowel sounds. The network prediction is the output whose value is
highest. The plot on the right illustrates the highly nonlinear decision surface represented by the learned
network. Points shown on the plot are test examples distinct from the examples used to train the
network.
We have toderived a gradient descent learning rule. However,multiple layers of cascaded linear units
still produce only linear functions,and we prefer networks capable of representing highly nonlinear
functions. Theperceptron unit is another possible choice, but its discontinuous threshold make situn
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
79
differentiable and hence unsuitable for gradient descent. What we need is unit whose output is a
nonlinear function of its inputs, but whose output is also differentiable function of its inputs. One
solution is the sigmoid unit-a unit very much like a perceptron, but based on a smoothed, differentiable
threshold function.The sigmoid unit is illustrated like the perceptron, the sigmoid unit first computes a
linear combination of its inputs, then applies a threshold to the result. In the case of the sigmoid unit,
however, the threshold output is continuous function of its input.
a is often called the sigmoid function or, alternatively, the logistic function. Noteits output ranges
between 0 and 1, increasing monotonically with its input.
Linear unit: multi-layers of linear units still produce only linear function
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
80
Where outputs is the set of output units in the network, and tkd and Okd are the I target and output
values associated with the kth output unit and training example d. The learning problem faced by
BACKPROPAGAT is to search a large hypothesis space defined by all possible weight values for all the
units in the network. The situation can be visualized in terms of an error surface similar to that shown
for linear units in Figure 4.4. The error in that diagram is replaced by our new definition of E, and the
other dimensions of the space correspond now to all of the weights associated with all of the units in the
network. As in the case of training a single unit, gradient descent can be used to attempt to find a
hypothesis to minimize E.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET
81
Here the axes woand wlrepresent possible values for the two weights of a simple linear unit. The wo,
wlplane therefore represents the entire hypothesis space. The vertical axis indicates the error E relative
to some fixed set of training examples. The error surface shown in the figure thus summarizes the
desirability of every weight vector in the hypothesis space (we desire a hypothesis with minimum error).
Given the way in which we chose to define E, for linear units this error surface must always be parabolic
with a single global minimum. The specific parabola will depend, of course, on the particular set of
training examples.
Srikanth Mandela
M. Tech.,(Ph.D)
Asst.Prof, Dept of CSE,SIET