0% found this document useful (0 votes)

13 views

DM Classification 1 3

The document discusses classification and prediction in data mining. It covers topics like classification vs prediction, supervised vs unsupervised learning, decision tree induction, and evaluating different classification methods. Accuracy, speed, robustness, scalability, and interpretability are some ways to evaluate classification methods.

Uploaded by

leroykendal

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

DM Classification 1 3

Uploaded by

leroykendal

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Data mining

Classification and Prediction

(1/3)

January 11, 2024 Data Mining: Concepts and Techniques 1

Classification and Prediction
• What is classification? What is • Support Vector Machines
prediction? (SVM)
• Issues regarding classification • Associative classification
and prediction • Other classification methods
• Classification by decision tree • Prediction
induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification
• Model selection
• Classification by back
• Summary
propagation

January 11, 2024 Data Mining: Concepts and Techniques 2

Classification vs. Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and
uses it in classifying new data
• Prediction
• models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
• Credit approval
• Target marketing
• Medical diagnosis
• Fraud detection
January 11, 2024 Data Mining: Concepts and Techniques 3
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
• If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
January 11, 2024 Data Mining: Concepts and Techniques 4
Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
January 11, 2024 Data Mining: Concepts and Techniques 5
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
January 11, 2024
Assistant Prof Data Mining:
7 Concepts andyes
Techniques 6
Supervised vs. Unsupervised Learning

• Supervised learning (classification)

• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
January 11, 2024 Data Mining: Concepts and Techniques 7
Classification and Prediction

• What is classification? What is • Support Vector Machines (SVM)

prediction? • Associative classification
• Issues regarding classification • Other classification methods
and prediction • Prediction
• Classification by decision tree induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification
• Model selection
• Classification by back propagation
• Summary

January 11, 2024 Data Mining: Concepts and Techniques 8

Issues: Data Preparation
• Data cleaning
• Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
• Remove the irrelevant or redundant attributes
• Data transformation
• Generalize and/or normalize data

January 11, 2024 Data Mining: Concepts and Techniques 9

Issues: Evaluating Classification Methods

• Accuracy
• classifier accuracy: predicting class label
• predictor accuracy: guessing value of predicted attributes
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules

January 11, 2024 Data Mining: Concepts and Techniques 10

Classification and Prediction

• What is classification? What is • Support Vector Machines (SVM)

prediction? • Associative classification
• Issues regarding classification and • Other classification methods
prediction
• Prediction
• Classification by decision tree induction
• Accuracy and error measures
• Bayesian classification
• Ensemble methods
• Rule-based classification
• Model selection
• Classification by back propagation
• Summary

January 11, 2024 Data Mining: Concepts and Techniques 11

Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer

<=30 high no fair no
This follows an <=30
31…40
high
high
no
no
excellent
fair
no
yes
example of >40 medium no fair yes
Quinlan’s ID3 >40
>40
low
low
yes fair
yes excellent
yes
no
(Playing 31…40 low yes excellent yes
Tennis) <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

January 11, 2024 Data Mining: Concepts and Techniques 12

Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

January 11, 2024 Data Mining: Concepts and Techniques 13

Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are discretized in
advance)
• Examples are partitioned recursively based on selected attributes
• Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
• There are no samples left

January 11, 2024 Data Mining: Concepts and Techniques 14

Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple
m
in D: Info ( D)   pi log 2 ( pi )
i 1

 Information needed (after using A to split D into v

v |D |
partitions) to classify D: Info A ( D)  
j
 I (D j )
j 1 | D |

 Information gained by branching on attribute A

Gain(A)  Info(D)  Info A(D)
January 11, 2024 Data Mining: Concepts and Techniques 15
Attribute Selection: Information Gain

g Class P: buys_computer = “yes” 5 4

Info age ( D )  I (2,3)  I (4,0)
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info ( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
age pi ni I(p i, n i) 5
I (2,3)means “age <=30” has 5 out of 14
<=30 2 3 0.971 14 samples, with 2 yes’es and 3 no’s.
31…40 4 0 0
Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age)  Info ( D )  Info age ( D )  0.246
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes Similarly,
>40 medium no fair yes
>40 low yes fair yes
>40
31…40
low
low
yes
yes
excellent
excellent
no
yes
Gain(income)  0.029
Gain( student )  0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating )  0.048
31…40 high yes fair yes 16
>40 medium no excellent no
Enhancements to Basic Decision Tree Induction

• Allow for continuous-valued attributes

• Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
• Handle missing attribute values
• Assign the most common value of the attribute
• Assign probability to each of the possible values
• Attribute construction
• Create new attributes based on existing ones that are sparsely
represented
• This reduces fragmentation, repetition, and replication
January 11, 2024 Data Mining: Concepts and Techniques 17
Classification in Large Databases

• Classification—a classical problem extensively studied by

statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why decision tree induction in data mining?
• relatively faster learning speed (than other classification
methods)
• convertible to simple and easy to understand classification
rules
• can use SQL queries for accessing databases
• comparable classification accuracy with other methods

January 11, 2024 Data Mining: Concepts and Techniques 18

Scalable Decision Tree Induction Methods
• SLIQ (EDBT’96 — Mehta et al.)
• Builds an index for each attribute and only class list and the
current attribute list reside in memory
• SPRINT (VLDB’96 — J. Shafer et al.)
• Constructs an attribute list data structure
• PUBLIC (VLDB’98 — Rastogi & Shim)
• Integrates tree splitting and tree pruning: stop growing the
tree earlier
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
• Builds an AVC-list (attribute, value, class label)
• BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
• Uses bootstrapping to create several small samples

January 11, 2024 Data Mining: Concepts and Techniques 19

Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
115 pages
Chapter4 Classification Prediction
No ratings yet
Chapter4 Classification Prediction
173 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
139 pages
Chapter 5. Classification and Prediction
No ratings yet
Chapter 5. Classification and Prediction
122 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Classification - Prediction Data Model Very Important
No ratings yet
Classification - Prediction Data Model Very Important
173 pages
DM Ch6 (Classification and Prediction)
No ratings yet
DM Ch6 (Classification and Prediction)
39 pages
Chapter 6 Classification and Prediction25.10.13
No ratings yet
Chapter 6 Classification and Prediction25.10.13
43 pages
7 Class
No ratings yet
7 Class
72 pages
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
No ratings yet
Slides For Textbook - Chapter 7 - : March 6, 2014 Data Mining: Concepts and Techniques 1
23 pages
7 Class
No ratings yet
7 Class
72 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
Chapter 7. Classification and Prediction
No ratings yet
Chapter 7. Classification and Prediction
68 pages
Data Mining: UNIT-3 Classification
No ratings yet
Data Mining: UNIT-3 Classification
54 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
88 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 7
72 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
Chap 7
No ratings yet
Chap 7
71 pages
Classfication and Prediction
No ratings yet
Classfication and Prediction
133 pages
Unit V Classification
No ratings yet
Unit V Classification
69 pages
Classification & Prediction
No ratings yet
Classification & Prediction
78 pages
Classification Prediction
No ratings yet
Classification Prediction
71 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
112 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Classification lecture 1
No ratings yet
Classification lecture 1
51 pages
Class Basic
No ratings yet
Class Basic
67 pages
A Data Mining Query Language
No ratings yet
A Data Mining Query Language
69 pages
DmUnit 3
No ratings yet
DmUnit 3
42 pages
08 Class Basic
No ratings yet
08 Class Basic
76 pages
Data Mining Book
No ratings yet
Data Mining Book
84 pages
_08ClassBasic_v1
No ratings yet
_08ClassBasic_v1
46 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
LECTURE 8
No ratings yet
LECTURE 8
81 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Classification Intr DT .Pptx
No ratings yet
Classification Intr DT .Pptx
31 pages
Lecture 10
No ratings yet
Lecture 10
53 pages
Classification
No ratings yet
Classification
73 pages
Unit 4 DM
No ratings yet
Unit 4 DM
88 pages
Lecture 4
No ratings yet
Lecture 4
79 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
No ratings yet
Classification and Prediction: Data Mining 이복주 단국대학교 컴퓨터공학과
75 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Classification Ppts 2021
No ratings yet
Classification Ppts 2021
80 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
08 Class Basic
No ratings yet
08 Class Basic
141 pages
dm4
No ratings yet
dm4
68 pages
05 Classification
No ratings yet
05 Classification
79 pages
7 Classification
100% (3)
7 Classification
63 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Module - 4.1-DM-1
No ratings yet
Module - 4.1-DM-1
63 pages
Data Mining Concepts and Techniques
No ratings yet
Data Mining Concepts and Techniques
30 pages
CH 5
No ratings yet
CH 5
81 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
87 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages

DM Classification 1 3

Uploaded by

DM Classification 1 3

Uploaded by

Data mining

Classification and Prediction

January 11, 2024 Data Mining: Concepts and Techniques 1

January 11, 2024 Data Mining: Concepts and Techniques 2

NAME RANK YEARS TENURED Classifier

• Supervised learning (classification)

• What is classification? What is • Support Vector Machines (SVM)

January 11, 2024 Data Mining: Concepts and Techniques 8

January 11, 2024 Data Mining: Concepts and Techniques 9

January 11, 2024 Data Mining: Concepts and Techniques 10

• What is classification? What is • Support Vector Machines (SVM)

January 11, 2024 Data Mining: Concepts and Techniques 11

age income student credit_rating buys_computer

January 11, 2024 Data Mining: Concepts and Techniques 12

student? yes credit rating?

no yes excellent fair

January 11, 2024 Data Mining: Concepts and Techniques 13

January 11, 2024 Data Mining: Concepts and Techniques 14

 Information needed (after using A to split D into v

 Information gained by branching on attribute A

g Class P: buys_computer = “yes” 5 4

• Allow for continuous-valued attributes

• Classification—a classical problem extensively studied by

January 11, 2024 Data Mining: Concepts and Techniques 18

January 11, 2024 Data Mining: Concepts and Techniques 19

You might also like