0% found this document useful (0 votes)
13 views

DM Classification 1 3

The document discusses classification and prediction in data mining. It covers topics like classification vs prediction, supervised vs unsupervised learning, decision tree induction, and evaluating different classification methods. Accuracy, speed, robustness, scalability, and interpretability are some ways to evaluate classification methods.

Uploaded by

leroykendal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

DM Classification 1 3

The document discusses classification and prediction in data mining. It covers topics like classification vs prediction, supervised vs unsupervised learning, decision tree induction, and evaluating different classification methods. Accuracy, speed, robustness, scalability, and interpretability are some ways to evaluate classification methods.

Uploaded by

leroykendal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Data mining

Classification and Prediction


(1/3)

January 11, 2024 Data Mining: Concepts and Techniques 1


Classification and Prediction
• What is classification? What is • Support Vector Machines
prediction? (SVM)
• Issues regarding classification • Associative classification
and prediction • Other classification methods
• Classification by decision tree • Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification
• Model selection
• Classification by back
• Summary
propagation

January 11, 2024 Data Mining: Concepts and Techniques 2


Classification vs. Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and
uses it in classifying new data
• Prediction
• models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
• Credit approval
• Target marketing
• Medical diagnosis
• Fraud detection
January 11, 2024 Data Mining: Concepts and Techniques 3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
• If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
January 11, 2024 Data Mining: Concepts and Techniques 4
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
January 11, 2024 Data Mining: Concepts and Techniques 5
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
January 11, 2024
Assistant Prof Data Mining:
7 Concepts andyes
Techniques 6
Supervised vs. Unsupervised Learning

• Supervised learning (classification)


• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
January 11, 2024 Data Mining: Concepts and Techniques 7
Classification and Prediction

• What is classification? What is • Support Vector Machines (SVM)


prediction? • Associative classification
• Issues regarding classification • Other classification methods
and prediction • Prediction
• Classification by decision tree induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification
• Model selection
• Classification by back propagation
• Summary

January 11, 2024 Data Mining: Concepts and Techniques 8


Issues: Data Preparation
• Data cleaning
• Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
• Remove the irrelevant or redundant attributes
• Data transformation
• Generalize and/or normalize data

January 11, 2024 Data Mining: Concepts and Techniques 9


Issues: Evaluating Classification Methods

• Accuracy
• classifier accuracy: predicting class label
• predictor accuracy: guessing value of predicted attributes
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules

January 11, 2024 Data Mining: Concepts and Techniques 10


Classification and Prediction

• What is classification? What is • Support Vector Machines (SVM)


prediction? • Associative classification
• Issues regarding classification and • Other classification methods
prediction
• Prediction
• Classification by decision tree induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification
• Model selection
• Classification by back propagation
• Summary

January 11, 2024 Data Mining: Concepts and Techniques 11


Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer


<=30 high no fair no
This follows an <=30
31…40
high
high
no
no
excellent
fair
no
yes
example of >40 medium no fair yes
Quinlan’s ID3 >40
>40
low
low
yes fair
yes excellent
yes
no
(Playing 31…40 low yes excellent yes
Tennis) <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

January 11, 2024 Data Mining: Concepts and Techniques 12


Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

January 11, 2024 Data Mining: Concepts and Techniques 13


Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are discretized in
advance)
• Examples are partitioned recursively based on selected attributes
• Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
• There are no samples left

January 11, 2024 Data Mining: Concepts and Techniques 14


Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple
m
in D: Info ( D)   pi log 2 ( pi )
i 1

 Information needed (after using A to split D into v


v |D |
partitions) to classify D: Info A ( D)  
j
 I (D j )
j 1 | D |

 Information gained by branching on attribute A


Gain(A)  Info(D)  Info A(D)
January 11, 2024 Data Mining: Concepts and Techniques 15
Attribute Selection: Information Gain

g Class P: buys_computer = “yes” 5 4


Info age ( D )  I (2,3)  I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info ( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
I (2,3)means “age <=30” has 5 out of 14
<=30 2 3 0.971 14 samples, with 2 yes’es and 3 no’s.
31…40 4 0 0
Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age)  Info ( D )  Info age ( D )  0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40
31…40
low
low
yes
yes
excellent
excellent
no
yes
Gain(income)  0.029
Gain( student )  0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating )  0.048
31…40 high yes fair yes 16
>40 medium no excellent no
Enhancements to Basic Decision Tree Induction

• Allow for continuous-valued attributes


• Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
• Handle missing attribute values
• Assign the most common value of the attribute
• Assign probability to each of the possible values
• Attribute construction
• Create new attributes based on existing ones that are sparsely
represented
• This reduces fragmentation, repetition, and replication
January 11, 2024 Data Mining: Concepts and Techniques 17
Classification in Large Databases

• Classification—a classical problem extensively studied by


statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why decision tree induction in data mining?
• relatively faster learning speed (than other classification
methods)
• convertible to simple and easy to understand classification
rules
• can use SQL queries for accessing databases
• comparable classification accuracy with other methods

January 11, 2024 Data Mining: Concepts and Techniques 18


Scalable Decision Tree Induction Methods
• SLIQ (EDBT’96 — Mehta et al.)
• Builds an index for each attribute and only class list and the
current attribute list reside in memory
• SPRINT (VLDB’96 — J. Shafer et al.)
• Constructs an attribute list data structure
• PUBLIC (VLDB’98 — Rastogi & Shim)
• Integrates tree splitting and tree pruning: stop growing the
tree earlier
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
• Builds an AVC-list (attribute, value, class label)
• BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
• Uses bootstrapping to create several small samples

January 11, 2024 Data Mining: Concepts and Techniques 19

You might also like