Unit 2
Unit 2
Classification
Classification
• Introduction
• Statistical Based Algorithm
• Distance Based Algorithm
• Tree Based Algorithm
• Rule Based Algorithm
• Neural Network Based Algorithm
• Combining Technique
Introduction
• Classification involves mapping of input data to appropriate
classes.
• Def: Given a database D = {t1 , t2 , ... , tn } of tuples (items,
records) and a set of classes C = { C 1, ... , Cm }, the
classification problem is to define a mapping f: D C where
each ti is assigned to one class. A class, Cj , contains precisely
those tuples mapped to it; that is, Cj = {ti |f(ti ) = Cj , 1 ≤ i ≤ n and
ti E D}.
• The problem is implemented in two phases:
1.Create a specific model by evaluating the training data.
2. Apply the model to classifying tuples from the target database.
Introduction
Introduction
• Issues In Classification:.
1. Missing Data
2. Measuring Performance.
Missing Data
There are many approaches to handle the missing data:
ALGORITHM
• There are two basic steps in the technique: building the tree and
applying the tree to the database.
Each arc is labeled with a predicate that can be applied to the attribute
associated with the parent.
•ID3
•C4.5 and C5.0
•CART
•Scalable DT Techniques
ID3
• The ID3 technique to building a decision tree is based on
information theory and attempts to minimize the expected
number of comparisons.
• The basic strategy used by ID3 is to choose splitting attributes
with the highest information gain first.
• The concept used to quantify information is called entropy.
• Entropy is used to measure the amount of uncertainty or
surprise or randomness in a set of data.
ID3