UNIT-3
UNIT-3
Classification
Prediction: prediction also used for to know the unknown or missing values..
which also uses some models in order to predict the attributes
models like neural networks, if else rules and other mechanisms
generates two
potential outcomes,
Nominal Attributes
: Since a nominal attribute can have many values, its
attribute test condition can be expressed in two ways, as a multiway split or a
binary split.
Binary split : partitioning all values taken by the nominal attribute into two
groups.
For example, some decision tree algorithms, such as CART, produce only
binary splits by considering all 2k−1 − 1 ways of creating a binary partition of k
attribute values. Figure illustrates three different ways of grouping the
attribute values for marital status into two subsets.
Ordinal Attributes :
Ordinal attributes can also produce binary or multiway splits. Ordinal attribute
values can be grouped as long as the grouping does not violate the order
property of the attribute values. Figure illustrates various ways of splitting
training records based on the Shirt
Size attribute. The groupings shown in Figures (a) and (b) preserve the order
among the attribute values, whereas the grouping shown in Figure c) violates
this property because it combines the attribute values Small and Large into the
same partition while Medium and Extra Large are combined into another
partition.
Based on these calculations, node N1 has the lowest impurity value, followed by
N2 and N3. This example, along with Figure, shows the consistency among the
impurity measures, i.e., if a node N1 has lower entropy than node N2, then the
Gini index and error rate of N1 will also be lower than that of N2. Despite their
agreement, the attribute chosen as splitting criterion by the impurity measures
can still be different.
Collective Impurity of Child Nodes:
Consider an attribute test condition that splits a node containing N training
instances into k children, {v1,v2,···,vk}, where every child node represents a
partition of the data resulting from one of the k outcomes of the attribute test
condition. Let N(vj) be the number of training instances associated with a
child node vj, whose impurity value is I(vj). Since a training instance in the
parent node reaches node vj for a fraction of N(vj)/N times, the collective
impurity of the child nodes can be computed by taking a weighted sum of the
impurities of the child nodes, as follows:
If Home Owner is chosen as the splitting attribute, the Gini index for the child
nodes N1 and N2 are 0 and 0.490, respectively. The weighted average Gini
index for the children is
Gain Ratio
Gain Ratio is modification of information gain that reduces its bias. Gain
ratio overcomes the problem with information gain by taking into
account the number of branches that would result before making the
split.It corrects information gain by taking the intrinsic information of a
split into account.We can also say Gain Ratio will add penalty to
information gain.
One potential limitation of impurity measures such as entropy and Gini index is
that they tend to favor qualitative attributes with large number of distinct
values. Figure shows three candidate attributes for partitioning the data set. As
previously mentioned, the attribute Marital Status is a better choice than the
attribute Home Owner, because it provides a larger information gain. However,
if we compare them against Customer ID, the latter produces the purest
partitions with the maximum information gain, since the weighted entropy and
Gini index is equal to zero for its children. Yet, Customer ID is not a good
attribute for splitting because it has a unique value for each instance. Even
though a test condition involving Customer ID will accurately classify every
instance in the training data, we cannot use such a test condition on new
test instances with Customer ID values that haven’t been seen before
during training. This example suggests having a low impurity value alone is
insufficient to find a good attribute test condition for a node. The number
of children produced by the splitting attribute should also be taken into
consideration while deciding the best attribute test condition. There are
two ways to overcome this problem. One way is to generate only binary
decision trees, thus avoiding the difficulty of handling attributes with
varying number of partitions. This strategy is employed by decision tree
classifiers such as CART. Another way is to modify the splitting criterion to
take into account the number of partitions produced by the attribute. For
example, in the C4.5 decision tree algorithm, a measure known as gain
ratio is used to compensate for attributes that produce a large number of
child nodes. This measure is computed as follows:
where N(vi) is the number of instances assigned to node vi and k is the total
number of splits. The split information measures the entropy of splitting a node
into its child nodes and evaluates if the split results in a larger number of
equally-sized child nodes or not. For example, if every partition has the same
number of instances, then ∀i : N(vi)/N =1/k and the split information would be
equal to log2 k. Thus, if an attribute produces a large number of splits, its split
information is also large, which in turn, reduces the gain ratio.
determines
2. The findthe attribute
best test
split() condition for
function
partitioning the training instances associated with a node. The splitting
attribute chosen depends on the impurity measure used. The popular measures
include entropy and the Gini index.
3. The Classify() function determines the class label to be assigned to a leaf
node. For each leaf node t, let p(i|t) denote the fraction of training instances
from class i associated with the node t. The label assigned to the leaf node is
typically the one that occurs most frequently in the training instances that are
associated with this node.
where the argmax operator returns the class i that maximizes p(i|t). Besides
providing the information needed to determine the class label of a leaf node, p(i|
t) can also be used as a rough estimate of the probability that an instance
assigned to the leaf node t belongs to class i.
Confusion Matrix?
The predicted value was falsely predicted
The actual value was positive but the model predicted a negative
value
Also known as the Type 2 error
Let me give you an example to better understand this. Suppose
we had a classification dataset with 1000 data points. We fit a
classifier on it and get the below confusion matrix:
True Positive (TP) = 560; meaning 560 positive class data points
were correctly classified by the model
True Negative (TN) = 330; meaning 330 negative class data
points were correctly classified by the model
False Positive (FP) = 60; meaning 60 negative class data points
were incorrectly classified as belonging to the positive class by
the model
False Negative (FN) = 50; meaning 50 positive class data points
were incorrectly classified as belonging to the negative class by
the model
This turned out to be a pretty decent classifier for our dataset
considering the relatively larger number of true positive and true
negative values.
Example
Suppose we are trying to create a model that can predict the result for
the disease that is either a person has that disease or not. So, the
confusion matrix for this is given as:
o The table is given for the two-class classifier, which has two
predictions "Yes" and "NO." Here, Yes defines that patient has the
disease, and No defines that patient does not has that disease.
o The classifier has made a total of 100 predictions. Out of 100
predictions, 89 are true predictions, and 11 are incorrect
predictions.
o The model has given prediction "yes" for 32 times, and "No" for 68
times. Whereas the actual "Yes" was 27, and actual "No" was 73
times.
Recall vs Precision
Recall is the number of relevant documents retrieved by a search divided
by the total number of existing relevant documents, while precision is the
number of relevant documents retrieved by a search divided by the total
number of documents retrieved by that search.
o F-measure: If two models have low precision and high recall or
vice versa, it is difficult to compare these models. So, for this
purpose, we can use F-score. This score helps us to evaluate the
recall and precision at the same time. The F-score is maximum if
the recall is equal to the precision. It can be calculated using the
below formula:
o Null Error rate: It defines how often our model would be incorrect
if it always predicted the majority class. As per the accuracy
paradox, it is said that "the best classifier has a higher error rate
than the null error rate."
o ROC Curve: The ROC is a graph displaying a classifier's
performance for all possible thresholds. The graph is plotted
between the true positive rate (on the Y-axis) and the false Positive
rate (on the x-axis).
o PRC area:Precision Recall area
o MCC-Mathews Correlation Coefficient