ML Unit 3
ML Unit 3
SUPERVISED LEARNING
Syllabus:
Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between
x (input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a person.
The regression line is the best fit line for our model.
By achieving the best-fit regression line, the model aims to predict y value such that the error
difference between predicted value and true value is minimum. So, it is very important to
update the θ1 and θ2 values, to reach the best value that minimize the error between predicted
y value (pred) and true y value (y).
Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between
predicted y value (pred) and true y value (y).
Gradient Descent:
To update θ1 and θ2 values in order to reduce Cost function (minimizing RMSE value) and
achieving the best fit line the model uses Gradient Descent. The idea is to start with random
θ1 and θ2 values and then iteratively updating the values, reaching minimum cost.
This type of statistical model (also known as logit model) is often used for classification and
predictive analytics. Logistic regression estimates the probability of an event occurring, such
as voted or didn’t vote, based on a given dataset of independent variables. Since the outcome
is a probability, the dependent variable is bounded between 0 and 1. In logistic regression, a
logit transformation is applied on the odds—that is, the probability of success divided by the
probability of failure. This is also commonly known as the log odds, or the natural logarithm
of odds, and this logistic function is represented by the following formulas:
Logit(pi) = 1/(1+ exp(-pi))
In this logistic regression equation, logit(pi) is the dependent or response variable and x is the
independent variable. The beta parameter, or coefficient, in this model is commonly estimated
via maximum likelihood estimation (MLE). This method tests different values of beta through
multiple iterations to optimize for the best fit of log odds. All of these iterations produce the
log likelihood function, and logistic regression seeks to maximize this function to find the best
parameter estimate. Once the optimal coefficient (or coefficients if there is more than one
independent variable) is found, the conditional probabilities for each observation can be
calculated, logged, and summed together to yield a predicted probability. For binary
classification, a probability less than .5 will predict 0 while a probability greater than 0 will
predict 1. After the model has been computed, it’s best practice to evaluate the how well the
model predicts the dependent variable, which is called goodness of fit. The Hosmer–
Lemeshow test is a popular method to assess model fit.
Log odds can be difficult to make sense of within a logistic regression data analysis. As a
result, exponentiating the beta estimates is common to transform the results into an odds
ratio (OR), easing the interpretation of results. The OR represents the odds that an outcome
will occur given a particular event, compared to the odds of the outcome occurring in the
absence of that event. If the OR is greater than 1, then the event is associated with a higher
odd of generating a specific outcome. Conversely, if the OR is less than 1, then the event is
associated with a lower odd of that outcome occurring. Based on the equation from above,
the interpretation of an odds ratio can be denoted as the following: the odds of a success
changes by exp(cB_1) times for every c-unit increase in x. To use an example, let’s say that
we were to estimate the odds of survival on the Titanic given that the person was male, and
the odds ratio for males was .0810. We’d interpret the odds ratio as the odds of survival of
males decreased by a factor of .0810 when compared to females, holding all other variables
constant.
There are three types of logistic regression models, which are defined based on categorical
response.
Logistic regression is commonly used for prediction and classification problems. Some of these
use cases include:
Fraud detection: Logistic regression models can help teams identify data anomalies,
which are predictive of fraud. Certain behaviors or characteristics may have a higher
association with fraudulent activities, which is particularly helpful to banking and other
financial institutions in protecting their clients. SaaS-based companies have also
started to adopt these practices to eliminate fake user accounts from their datasets
when conducting data analysis around business performance.
Disease prediction: In medicine, this analytics approach can be used to predict the
likelihood of disease or illness for a given population. Healthcare organizations can set
up preventative care for individuals that show higher propensity for specific illnesses.
Churn prediction: Specific behaviors may be indicative of churn in different functions
of an organization. For example, human resources and management teams may want
to know if there are high performers within the company who are at risk of leaving the
organization; this type of insight can prompt conversations to understand problem
areas within the company, such as culture or compensation. Alternatively, the sales
organization may want to learn which of their clients are at risk of taking their business
elsewhere. This can prompt teams to set up a retention strategy to avoid lost revenue.
Below are some assumptions that we made while using decision tree:
At the beginning, we consider the whole training set as the root.
Feature values are preferred to be categorical. If the values are continuous then
they are discretized prior to building the model.
On the basis of attribute values records are distributed recursively.
We use statistical methods for ordering attributes as root or the internal node.
As you can see from the above image that Decision Tree works on the Sum of Product form
which is also known as Disjunctive Normal Form. In the above image, we are predicting the
use of computer in the daily life of the people. In Decision Tree the major challenge is to
identification of the attribute for the root node in each level.
This process is known as attribute selection. We have two popular attribute selection
measures:
1. Information Gain
2. Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into smaller
subsets the entropy changes. Information gain is a measure of this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with
A = v, and Values (A) is the set of all possible values of A, then
Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the
impurity of an arbitrary collection of examples. The higher the entropy more the
information content.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with
A = v, and Values (A) is the set of all possible values of A, then
Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3
=-(-0.53-0.424)
= 0.954
The essentials:
Start with all training instances associated with the root node
Use info gain to choose which attribute to label each node with
Note: No root-to-leaf path should contain the same discrete attribute twice
Recursively construct each subtree on the subset of training instances that would
be classified down that path in the tree.
The border cases:
If all positive or all negative training instances remain, label that node “yes” or “no”
accordingly
If no attributes remain, label with a majority vote of training instances left at that
node
If no instances remain, label with a majority vote of the parent’s training instances
Example:
Now, lets draw a Decision Tree for the following data using Information gain.
Training set: 3 features and 2 classes
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II
Split on feature X
Split on feature Y
Split on feature Z
From the above images we can see that the information gain is maximum when we make a
split on feature Y. So, for the root node best suited feature is feature Y. Now we can see that
while splitting the dataset by feature Y, the child contains pure subset of the target variable.
So we don’t need to further split the dataset.
The final tree for the above dataset would be look like this:
2. Gini Index
Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
It means an attribute with lower Gini index should be preferred.
Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
The Formula for the calculation of the of the Gini Index is given below,
Example:
Lets consider the dataset in the image below and draw a decision tree using gini
index.
Index A B C D E
In the dataset above there are 5 attributes from which attribute E is the predicting
feature which contains 2(Positive & Negative) classes. We have an equal proportion
for both the classes.In Gini Index, we have to choose some random values to
categorize each attribute. These values for this dataset are:
A B C D
Using the same approach we can calculate the Gini index for C and D attributes.
Positive Negative
|<5 3 1
Positive Negative
|< 3.0 0 4
Gini Index of B= 0.3345
Positive Negative
|< 4.2 8 2
Rule-based classifiers are just another type of classifier which makes the class decision
depending by using various “if..else” rules. These rules are easily interpretable and thus these
classifiers are generally used to generate descriptive models. The condition used with “if” is
called the antecedent and the predicted class of each rule is called the consequent.
Properties of rule-based classifiers:
Coverage: The percentage of records which satisfy the antecedent conditions of a
particular rule.
The rules generated by the rule-based classifiers are generally not mutually
exclusive, i.e. many rules can cover the same record.
The rules generated by the rule-based classifiers may not be exhaustive, i.e. there
may be some records which are not covered by any of the rules.
The decision boundaries created by them is linear, but these can be much more
complex than the decision tree because the many rules are triggered for the same
record.
An obvious question, which comes into the mind after knowing that the rules are not mutually
exclusive is that how would the class be decided in case different rules with different
consequent cover the record.
Either rules can be ordered, i.e. the class corresponding to the highest priority rule
triggered is taken as the final class.
Otherwise, we can assign votes for each class depending on some their weights,
i.e. the rules remain unordered.
Example:
Below is the dataset to classify mushrooms as edible or poisonous:
Cap
Cap Surfac Bruise Stalk Populatio
Class Shape e s Odour Shape n Habitat
conve meadow
Edible x Scaly yes almond tapering scattered s
enlargenin
Edible flat fibrous yes anise g several woods
Cap
Cap Surfac Bruise Stalk Populatio
Class Shape e s Odour Shape n Habitat
enlargenin
Edible flat fibrous no none g several urban
3.4.1 Rules:
The algorithm given below generates a model with unordered rules and ordered classes, i.e.
we can decide which class to give priority while generating the rules.
A <-Set of attributes
T <-Set of training records
Y <-Set of classes
Y’ <-Ordered Y according to relevance
R <-Set of rules generated, initially to an empty list
for each class y in Y’
while the majority of class y records are not covered
generate a new rule for class y, using methods given above
Add this rule to R
Remove the records covered by this rule from T
end while
end for
Add rule {}->y’ where y’ is the default class
Note: The rule set can be also created indirectly by pruning(simplifying) other
already generated models like a decision tree.
1. Rule Generation
Once a decision tree has been constructed, it is a simple matter to convert it into an equivalent
set of rules.
Converting a decision tree to rules before pruning has three main advantages:
To generate rules, trace each path in the decision tree, from root node to leaf node, recording
the test outcomes as antecedents and the leaf-node classification as the consequent.
C1 C2 Marginal Sums
R1 x11 x12 R1T = x11 + x12
R2 x21 x22 R2T = x21 + x22
Marginal Sums CT1 = x11 + x21 CT2 = x12 + x22 T = x11 + x12 + x21 + x22
The marginal sums and T, the total frequency of the table, are used to calculate expected
cell values in step 3 of the test for independence.
The general formula for obtaining the expected frequency of any cell xij, 1 i r, 1 j c in a
contingency table is given by:
where RiT and CTj are the row total for ith row and the column total for jth column.
if then use
m 10 Chi-Square Test
5 m 10 Yates' Correction for Continuity
m 5 Fisher's Exact Test
= (r - 1)(c - 1)
7. Use a chi-square table with and df to determine if the conclusions are independent
from the antecedent at the selected level of significance, .
o If
Reject the null hypothesis of independence and accept the alternate
hypothesis of dependence.
We keep the antecedents because the conclusions are
dependent upon them.
o If
Accept the null hypothesis of independence.
We discard the antecedents because the conclusions are
independent from them.
· Chi-Square Test
Decision Lists
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified
using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and dogs so
that it can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
1. Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if
there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors.
The distance between the vectors and the hyperplane is called as margin. And the goal of
SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z.
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
A Radial basis function is a function whose value depends only on the distance from the origin.
In effect, the function must contain only real values. Alternative forms of radial basis functions
are defined as the distance from another point denoted C, called a center.
Source
A Radial basis function works by defining itself by the distance from its origin or center. This
is done by incorporating the absolute value of the function. Absolute values are defined as
the value without its associated sign (positive or negative). For example, the absolute value
of -4, is 4. Accordingly, the radial basis function is a function in which its values are defined
as:
The Gaussian variation of the Radial Basis Function, often applied in Radial Basis Function
Networks, is a popular alternative. The formula for a Gaussian with a one-dimensional input
is:
The Gaussian function can be plotted out with various values for Beta:
Source
Radial basis functions make up the core of the Radial Basis Function Network, or RBFN. This
particular type of neural network is useful in cases where data may need to be classified in a
non-linear way. RBFNs work by incorporating the Radial basis function as a neuron and using
it as a way of comparing input data to training data. An input vector is processed by multiple
Radial basis function neurons, with varying weights, and the sum total of the neurons produce
a value of similarity. If input vectors match the training data, they will have a high similarity
value. Alternatively, if they do not match the training data, they will not be assigned a high
similarity value. Comparing similarity values with different classifications of data allows for
non-linear classification.
What is a classifier?
A classifier is a machine learning model that is used to discriminate different objects based
on certain features.
3.8.1 Principle of Naive Bayes Classifier:
A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification
Bayes Theorem:
Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not affect
the other. Hence it is called naive. Example: Let us take an example to get some better
first row of the dataset, we can observe that is not suitable for playing golf if the outlook is
rainy, temperature is hot, humidity is high and it is not windy. We make two assumptions
here, one as stated above we consider that these predictors are independent. That is, if the
temperature is hot, it does not necessarily mean that the humidity is high. Another assumption
made here is that all the predictors have an equal effect on the outcome. That is, the day
being windy does not have more importance in deciding to play golf or not.
The variable y is the class variable(play golf), which represents if it is suitable to play golf or
not given the conditions. Variable X represent the parameters/features.
X is given as,
Here x_1,x_2….x_n represent the features, i.e they can be mapped to outlook, temperature,
humidity and windy. By substituting for X and expanding using the chain rule we get,
Now, you can obtain the values for each by looking at the dataset and substitute them into
the equation. For all entries in the dataset, the denominator does not change, it remain static.
In our case, the class variable(y) has only two outcomes, yes or no. There could be cases
where the classification could be multivariate. Therefore, we need to find the class y with
maximum probability.
Using the above function, we can obtain the class, given the predictors.
Since the way the values are present in the dataset changes, the formula for conditional
3.8.3 Conclusion: Naive Bayes algorithms are mostly used in sentiment analysis, spam
filtering, recommendation systems etc. They are fast and easy to implement but their biggest
disadvantage is that the requirement of predictors to be independent. In most of the real life
cases, the predictors are dependent, this hinders the performance of the classifier.
Upon the instance of burglary and fire, ‘P1’ and ‘P2’ call person ‘gfg’, respectively.
But, there are few drawbacks in this case, as sometimes ‘P1’ may forget to call the
person ‘gfg’, even after hearing the alarm, as he has a tendency to forget things,
quick. Similarly, ‘P2’, sometimes fails to call the person ‘gfg’, as he is only able to
hear the alarm, from a certain distance.
Example Problem:
Q)Find the probability that ‘P1’ is true (P1 has called ‘gfg’), ‘P2’ is true (P2 has called ‘gfg’)
when the alarm ‘A’ rang, but no burglary ‘B’ and fire ‘F’ has occurred.
=> P ( P1, P2, A, ~B, ~F) [ where- P1, P2 & A are ‘true’ events and ‘~B’ & ‘~F’ are ‘false’
events]
[ Note: The values mentioned below are neither calculated nor computed. They have
observed values ]
Burglary ‘B’ –
P (B=T) = 0.001 (‘B’ is true i.e burglary has occurred)
P (B=F) = 0.999 (‘B’ is false i.e burglary has not occurred) Fire
‘F’ –
P (F=T) = 0.002 (‘F’ is true i.e fire has occurred)
P (F=F) = 0.998 (‘F’ is false i.e fire has not occurred)
Alarm ‘A’ –
B F P (A=T) P (A=F)
T T 0.95 0.05
T F 0.94 0.06
F T 0.29 0.71
F F 0.001 0.999
The alarm ‘A’ node can be ‘true’ or ‘false’ ( i.e may have rung or may not have
rung). It has two parent nodes burglary ‘B’ and fire ‘F’ which can be ‘true’ or ‘false’
(i.e may have occurred or may not have occurred) depending upon different
conditions.
Person ‘P1’ –
A P (P1=T) P (P1=F)
T 0.95 0.05
F 0.05 0.95
The person ‘P1’ node can be ‘true’ or ‘false’ (i.e may have called the person ‘gfg’
or not) . It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may
have rung or may not have rung ,upon burglary ‘B’ or fire ‘F’).
Person ‘P2’ –
A P (P2=T) P (P2=F)
T 0.80 0.20
F 0.01 0.99
The person ‘P2’ node can be ‘true’ or false’ (i.e may have called the person ‘gfg’ or
not). It has a parent node, the alarm ‘A’, which can be ‘true’ or ‘false’ (i.e may have
rung or may not have rung, upon burglary ‘B’ or fire ‘F’).
In this assignment you will use a SVM to classify emails into spam or non-spam
categories. And report the classification accuracy for various SVM parameters and
kernel functions.
Data Set Description: An email is represented by various features like frequency of
occurrences of certain keywords, length of capitalized words etc.
A data set containing about 4601 instances are available in this link (data folder):
https://archive.ics.uci.edu/ml/datasets/Spambase.
You have to randomly pick 70% of the data set as training data and the remaining as
test data.
Assignment Tasks: In this assignment you can use any SVM package to classify the
above data set.
You should use one of the following languages: C/C++/Java/Python. You have to
study performance of the SVM algorithms.
You have to submit a report in pdf format.
The report should contain the following sections:
1. Methodology: Details of the SVM package used.
2. Experimental Results:
i.You have to use each of the following three kernel functions (a) Linear, (b)
Quadratic, (c) RBF.
ii. For each of the kernels, you have to report training and test set classification
accuracy for the best value of generalization constant C.
The best C value is the one which provides the best test set accuracy that you have
found out by trial of different values of C. Report accuracies in the form of a
comparison table, along with the values of C
11.PART A Q & A (WITH K LEVEL AND CO) UNIT 3
1. What are the two types of problems solved by Supervised Learning?
Classification: It uses algorithms to assign the test data into specific categories.
Common classification algorithms are linear classifiers, support vector
machines (SVM), decision trees, k-nearest neighbor, and random forest.
Regression: It is used to understand the relationship
between dependent and independent variables. Linear regression, logistical
regression, and polynomial regression are popular regression algorithms.
Support vector machines (SVMs) are a set of supervised learning methods used
for classification, regression and outliers detection. The objective of the support
vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the
number of features) that distinctly classifies the data points.
Support vector machines focus only on the points that are the most difficult to tell
apart, whereas other classifiers pay attention to all of the points.
The intuition behind the support vector machine approach is that if a classifier is good
at the most challenging comparisons (the points in B and A that are closest to each
other), then the classifier will be even better at the easy comparisons (comparing
points in B and A that are far away from each other).
o You get a bunch of photos with information about what is on them and
then you train a model to recognize new photos.
o You have a bunch of molecules and information about which are
drugs and you train a model to answer whether a new molecule is also
a drug.
o Based on past information about spams, filtering out a new incoming
email into Inbox (normal) or Junk folder (Spam)
o Cortana or any speech automated system in your mobile phone trains
your voice and then starts working based on this training.
o Train your handwriting to OCR system and once trained, it will be able
to convert your hand-writing images into text (till some accuracy
obviously)
The image above shows how similar points are closer to each other. KNN hinges
on this assumption being true enough for the algorithm to be useful.
There are many different ways of calculating the distance between the points,
however, the straight line distance (Euclidean distance) is a popular and familiar
choice.
The goal of any supervised machine learning algorithm is to best estimate the mapping
function (f) for the output variable (Y) given the input data (X). The mapping function
is often called the target function because it is the function that a given supervised
machine learning algorithm aims to approximate.
Bias are the simplifying assumptions made by a model to make the target function
easier to learn. Generally, linear algorithms have a high bias making them fast to learn
and easier to understand but generally less flexible.
Answer
Correlation measures how strongly two or more variables are related to each
other. Its values are between -1 to 1. Correlation measures both the strength
and direction of the linear relationship between two variables. Correlation is a
function of the covariance.
In a supervised logistic regression, features are mapped onto the output. The
output is usually a categorical value (which means that it is mapped with one-
hot vectors or binary numbers).
Since the logit function always outputs a value between 0 and 1, it gives
the probability of the outcome.
9. What are some challenges faced when using a Supervised Regression Model?
Some challenges faced when using a supervised regression model are:
Nonlinearities: Real-world data points are more complex and do not follow a
linear relationship. Sometimes a non-linear model is better at fitting the dataset.
So, it is a challenge to find the perfect equation for the dataset.
Multicollinearity: Multicollinearity is a phenomenon where one predictor
variable in a multiple regression model can be linearly predicted from the others
with a substantial degree of accuracy. If there is a problem of multicollinearity
then even the slightest change in the independent variable causes the output
to change erratically. Thus, the accuracy of the model is affected and it
undermines the quality of the whole model.
Outliers: Outliers can change and make huge impact on the machine learning
model. This happens because the regression model tries to fit the outliers into
the model as well.