0% found this document useful (0 votes)
6 views

Unit 2

The document provides an overview of classification techniques in data mining, detailing various algorithms such as statistical, distance-based, tree-based, rule-based, and neural network-based methods. It discusses the classification process, including handling missing data, measuring performance, and the importance of accuracy. Additionally, it covers specific algorithms like KNN, decision trees (ID3, C4.5, C5.0, CART), and combining techniques for improved classification results.

Uploaded by

manaskagankar07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit 2

The document provides an overview of classification techniques in data mining, detailing various algorithms such as statistical, distance-based, tree-based, rule-based, and neural network-based methods. It discusses the classification process, including handling missing data, measuring performance, and the importance of accuracy. Additionally, it covers specific algorithms like KNN, decision trees (ID3, C4.5, C5.0, CART), and combining techniques for improved classification results.

Uploaded by

manaskagankar07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 55

Unit 2

Classification
Classification
• Introduction
• Statistical Based Algorithm
• Distance Based Algorithm
• Tree Based Algorithm
• Rule Based Algorithm
• Neural Network Based Algorithm
• Combining Technique
Introduction
• Classification involves mapping of input data to appropriate
classes.
• Def: Given a database D = {t1 , t2 , ... , tn } of tuples (items,
records) and a set of classes C = { C 1, ... , Cm }, the
classification problem is to define a mapping f: D C where
each ti is assigned to one class. A class, Cj , contains precisely
those tuples mapped to it; that is, Cj = {ti |f(ti ) = Cj , 1 ≤ i ≤ n and
ti E D}.
• The problem is implemented in two phases:
1.Create a specific model by evaluating the training data.
2. Apply the model to classifying tuples from the target database.
Introduction
Introduction

• Issues In Classification:.

1. Missing Data

2. Measuring Performance.
Missing Data
There are many approaches to handle the missing data:

• Ignore the missing data.

• Assume a value for the missing data.

• Assume a special value for the missing data.


Measuring Performance and
Accuracy
• Classification accuracy is usually calculated by determining the
percentage of tuples placed in the correct class.
• Given a specific class and a database tuple may or may not be
assigned to that class while its actual membership may or may
not be in that class. This gives us four quadrants:
• True positive (TP): 𝑡𝑖 predicted to be in 𝐶𝑗 and is actually in it.
• False positive (FP): 𝑡𝑖 predicted to be in 𝐶𝑗 but is not actually
in it.
• True negative (TN): 𝑡𝑖 not predicted to be in 𝐶𝑗 and is not
actually in it.
• False negative (FN): 𝑡𝑖 not predicted to be in 𝐶𝑗 but is actually
in it.
Measuring Performance and
Accuracy
Measuring Performance and
Accuracy
Measuring Performance and
Accuracy
Statistical Based Algorithm
• Regression
• Bayesian Classification
Regression
• Regression used for classification deals with estimation of an
output value based on input values.

• It takes a set of data and fits the data to a formula.


• Simple linear regression problem can be thought of as
estimating the formula for a straight line .
Regression
• The actual data points do not fit the linear model exactly.
• we can estimate the accuracy of the fit of a linear regression
model to the actual data using a mean squared error function.
Regression
• Classification can be performed using two different approaches:
1) Division
2) Prediction

• The prediction is an estimate rather than the actual output value.


This technique does not work well with nonnumeric data.
Regression
• Another commonly used regression technique is logistic
regression .
• Instead of fitting the data to a straight line, logistic regression
uses a logistic curve.

• To perform the regression, the logarithmic function can be


applied to obtain the logistic function
Regression

• Here 𝑝 is the probability of being in the class and 1 − 𝑝 is the


probability that it is not. The process chooses values for 𝑐0 and
𝑐1 that maximize the probability of observing the given values.
Example:
1. Apply linear regression on following set of data and find linear
equation for the following two sets of data:
x 2 4 6 8
y 3 7 5 10
Bayesian Classification
• A classification scheme naive Bayes can be used for
classification.
• Bayes classification uses Bayes rule of conditional probability

• Here P (h1 I Xi) is called the posterior probability, while P (h1 )


is the prior probability associated with hypothesis h1 .
• P (xi) is the probability of the occurrence of data value Xi and
P (xi I h1) is the conditional probability that, given a
hypothesis,the tuple satisfies it
Bayesian Classification

probabilities 𝑃(𝐶j)and 𝑃(𝑥𝑖|𝐶𝑗 ), as well as 𝑃(𝑥𝑖) . From these


• Training data can be used to determine prior and conditional

probability 𝑃(𝐶𝑗|𝑥𝑖) and 𝑃(𝐶𝑗|𝑡𝑖 ).


values Bayes theorem allows us to estimate the posterior

• This must be done for all attributes and all values


• The probabilities are descriptive and are then used to predict the
class membership for a target tuple.
• We then estimate P(ti I Cj ) by
Bayesian Classification
• To calculate 𝑃(𝑡𝑖 ) we estimate the likelihoods for 𝑡𝑖 in each
class and add these values.
• The posterior probability 𝑃(𝐶𝑗 |𝑡𝑖 ) is then found for each class.
The class with the highest probability is the one chosen for the
tuple.
• Only one scan of training data is needed, it can handle missing
values. In simple relationships this technique often yields good
results.
• The technique does not handle continuous data.
Example

Here P (h1 I Xi) is called the posterior


probability, while P (h 1 ) is the prior probability
associated with hypothesis h 1 . P (xi) is the
probability of the occurrence of
data value Xi and P (xi I h1) is the conditional
probability that, given a hypothesis,
the tuple satisfies it
Example
• To facilitate classification, we divide the height attribute into six
ranges:
• (0, 1.6] , ( 1.6, 1.7] , ( 1.7, 1.8] , ( 1.8, 1.9], ( 1.9, 2.0], (2.0,∞)
• P (short) = 4/ 15 = 0.267, P (medium) = 8/15 = 0.533 , and P
(tall) = 3/ 1 5 = 0.2
• For example, suppose we wish to classify
t = (Adam , M, 1 .95 m)
• P(t I short) =1 /4 X 0 = 0
Example
• P (t I medium)=2/8 X 1/8 = 0.03 1
• P (t I tall)=3 /3 X 1/3 = 0.333
• Likelihood of being short 0 X 0.267 = 0
• Likelihood of being medium 0.03 1 X 0.533 = 0.0166
• Likelihood of being tall 0.33 X 0.2 = 0. 066
• We estimate P(t) by summing up these individual likelihood values since t will
be either short or medium or tall:
• P (t) = 0 + 0.0 1 66 + 0. 066 = 0.0826
• Finally, we obtain the actual probabilities of each event:
• P(short I t) =(0 X 0.0267)/0.0826= 0
• P (medium I t)=(0.03 1 X 0.5330/0.0826=0.2
• P (tall l t)=(0.333 X 0.20)/0.0826=0.799
Distance based
• Simple Approach
• KNN
Distance based
• Place items in class to which they are “closest”.
• Distance measure is used to find alikeness of different items.
• Simple Approach:Classes represented by
– Centroid: Central value.

ALGORITHM

Input : c1 , ... , cm //Centers for each class


t // Input tuple to classify
Output : C //Class to which t is assigned
Simple distance-based algorithm
dist = ∞;
for i := 1 to m do
if dis(ci , t) < dist, then
Distance based
C= i;
dist = dist(ci , t) ;
K Nearest Neighbor (KNN):
• Common classification scheme based on the use of distance
measures is that of the K nearest neighbors (KNN).
• Training set includes classes along with item set.
• When a classification is to be made for a new item, its distance
to each item in the training set must be determined.
• Only the K closest entries in the training set are considered
further
• New item placed in class that contains the most items from this
set of K closest items.
• O(q) for each tuple to be classified. (Here q is the size of the
training set.)
• KNN technique is extremely sensitive to the value of K. A rule of
thumb is that K ≤
Distance or Similarity Measures
K Nearest Neighbor (KNN):
KNN Example

Classify tuple t=(Pat, F, 1.6m) using KNN, k=5


K Nearest Neighbor (KNN):
Classification Using Decision Trees
• The decision tree approach is most useful in classification
problems.

• With this technique, a tree is constructed to model the


classification process. Once the tree is built, it is applied to each
tuple in the database and results in a classification for that tuple.

• There are two basic steps in the technique: building the tree and
applying the tree to the database.

• ID3, C4.5, CART are several popular decision tree approaches.


Decision Tree

• DEFINITION :Given a database D = {t1 , . . . , tn } where ti =


{ti1 , . . . , tih } and the database schema contains the following
attributes {A1 , A2, . . . , Ah } . Also given a set of classes C =
{ C 1 , . .. , Cm } . A decision tree (DT) or classification tree is a
tree associated with D that has the following properties:

 Each internal node is labeled with an attribute, Ai .

 Each arc is labeled with a predicate that can be applied to the attribute
associated with the parent.

 Each leaf node is labeled with a class, C j .


Decision Tree
• Solving the classification problem using decision
trees is a two-step process:

1. Decision tree induction: Construct a DT using


training data.

2. For each ti E D, apply the DT to determine its


class.
Advantages of DT
Advantages:-
• DTs are easy to use and efficient.
• Rules can be generated that are easy to interpret and
understand.
• They scale well for large databases because the tree
size is independent of the database size.
• Each tuple in the database must be filtered through the
tree.
• This takes time proportional to the height of the tree,
which is fixed.
• Trees can be constructed for data with many attributes.
Disadvantages of DT
Disadvantages:-
• They do not easily handle continuous data.

• These attribute domains must be divided into


categories to be handled.

• Handling missing data is difficult.

• Since the DT is constructed from the training data, over


fitting may occur.
DT Induction
Example
DT Splits Area
DT Splits Area
DT Issues
Choosing Splitting Attributes
Ordering of Splitting Attributes
Splits
Tree Structure
Stopping Criteria
Training Data
Pruning
DT Approaches

•ID3
•C4.5 and C5.0
•CART
•Scalable DT Techniques
ID3
• The ID3 technique to building a decision tree is based on
information theory and attempts to minimize the expected
number of comparisons.
• The basic strategy used by ID3 is to choose splitting attributes
with the highest information gain first.
• The concept used to quantify information is called entropy.
• Entropy is used to measure the amount of uncertainty or
surprise or randomness in a set of data.
ID3

• ID3 chooses the splitting attribute with the highest gain in


information.
• Where gain is defined as the difference between how much
information is needed to make a correct classification before the
split versus how much information is needed after the split.
• The ID3 algorithm calculates the gain of a particular split by the
following formula,
C4.5
The decision tree algorithm C4.5 improves ID3 in the following
ways:
• Missing data
• Continuous data
• Pruning
• Rules
• Splitting:In C4.5 splitting is based on GainRatio as opposed to
Gain, which ensures a larger than average information gain
C5.0
• C5.0 is based on boosting. Boosting is an approach to
combining different classifiers.
• It does not always help when the training data contains a lot of
noise.
• Boosting works by creating multiple training sets from one
training set.
• Thus, multiple classifiers are actually constructed.
• Each classifier is assigned a vote, voting is performed, and the
target tuple is assigned to the class with the most number of
votes.
CART
• Classification and regression trees (CART) is a technique that
generates a binary decision tree.
• As with ID3, entropy is used as a measure to choose the best
splitting attribute and criterion.
• Unlike ID3, however, where a child is created for each
subcategory, only two children are created.
• The splitting is performed around what is determined to be the
best split point.
• At each step, an exhaustive search is used to determine the
best split, where"best" is defined by
Scalable DT Techniques
• The SPRINT (Scalable PaRallelizable INduction of decision
Trees) algorithm.
• Addresses the scalability issue by ensuring that the CART
technique can be applied regardless of availability of main
memory.
• It can be easily parallelized.
• With SPRINT, a gini index is used to find the best split. Here gini
for a database D is defined as
Generating Rules from a Neural Net
Input : D //Training data
N / / Initial neural network
Output : R //Derived rules
RX algorithm:
//Rule extraction algorithm to extract rules from NN
cluster output node activation values ;
cluster hidden node activation values;
generate rules that describe the output values in terms of the
hidden activation values;
generate rules that describe hidden output values in terms of input
s;
combine the two sets of rules.
Generating Rules without a DT or
NN
• These techniques are sometimes called covering algorithms
because they attempt to generate rules exactly cover a specific
class .
• They generate the best rule possible by optimizing the desired
classification probability.
• Usually the "best" attribute-value pair is chosen, as opposed to
the best attribute with the tree-based algorithms.
• Suppose that we wished to generate a rule to classify persons
as tall.
• The basic format for the rule is then
If ? then class = tall
• The objective for the covering algorithms is to replace the "?" in
this statement with predicates that can be used to obtain the
"best" probability of being tall.
Generating Rules without a DT or
NN using 1R
Combining Technique
• Given classification problem if no one classification technique
yields the best result .
• Then we have to look for combining techniques.
• C5.0 introduced technique for combining classifiers called
boosting.
• Two basic techniques can be used to accomplish this:
 A synthesis of approaches takes multiple techniques and blends
them into new approach.
 Multiple independent approaches can be applied to a
classification problem each yielding its own class prediction

You might also like