Data Mining Unit 2
Data Mining Unit 2
🔍 1. Definition of Classification
f ( X )= y
o Classification rules
o Decision trees
o Mathematical models
Accuracy Evaluation:
o Use test data (with known labels, but not seen during
training).
o Calculate accuracy:
Correct predictions
Accuracy= × 100
Total test tuples
Term Description
Training Set Data with known class labels used to train the
model
Test Set Data with known labels (not used during training)
used to test the model
Training Data
Known Yes No
Classes?
🚫 Note on Overfitting
Structure:
✅ Advantages
Input:
Output:
A Decision Tree
Steps:
1. Create a node N.
1. Discrete-valued attribute:
2. Continuous-valued attribute:
o Two branches using a split-point.
🛑 Stopping Conditions
Computational Complexity
o n = number of attributes
💡 Notable Algorithms
🔄 Incremental Learning
v
|D j|
Info A ( D )=∑ ⋅ Info ( D j )
j=1 | D|
where D j is the subset of tuples with outcomea j of attribute A .
The attribute with the highest Information Gain is chosen for the
decision node.
Info ( D )=− ( 149 log 149 + 145 log 145 )=0.940 bits
2 2
Youth: 2 yes, 3 no
Middle-aged: 4 yes, 0 no
Senior: 3 yes, 2 no
5 4 5
Inf o age ( D )= ⋅ 0.971+ ⋅0+ ⋅0.971=0.694 bits
14 14 14
Then,
Gain Ratio
Gain Ratio is a metric used in decision tree algorithms (like C4.5) to select
the best attribute for splitting the dataset. It addresses a major limitation
of Information Gain, which tends to favor attributes with many distinct
values (like unique IDs), even when those attributes are poor for
generalization.
But such a split is useless for predicting new data (i.e., overfitting).
Formulas
Let’s define:
D : dataset
1. Split Information:
v
| D j|
SplitInfo A ( D )=−∑
j =1 | D|
This measures the potential information created by the split itself,
regardless of class labels.
2. Gain Ratio:
Gain ( A )
GainRatio ( A )=
SplitInfo A ( D )
This corrects the bias by dividing Information Gain by how “uniformly” the
attribute splits the data.
Example
Given a split on income resulting in three partitions:
low: 4 tuples
medium: 6 tuples
high: 4 tuples
( 144 log 144 + 146 log 146 + 144 log 144 )=1.557
SplitInfoincome ( D )=− 2 2 2
Step 2: Use Gain from Example 8.1
They are known for high accuracy and speed, especially on large
datasets.
Gini Index
The Gini Index is used in the CART (Classification and Regression
Trees) algorithm to measure the impurity of a dataset.
🔸 Definition
🔹 Binary Splits
CART allows only binary splits (dividing into two groups). For each
attribute, we:
3. Choose the split that minimizes the Gini index of the resulting
partitions.
🔸 Discrete Attributes
If attribute A has v values, then there are 2 v−22v −22 v−2 valid binary splits
(excluding the empty set and full set).
Example:
🔸 Continuous Attributes
"Is A ≤ split_point?"
Choose the attribute and split that maximizes the gain (i.e.,
minimizes Gini A ¿ .
( ) ( )
2 2
9 5
Gini ( D )=1− − =0.459
14 14
1. Information Gain:
2. Gain Ratio:
3. Gini Index:
4. Other Measures:
o G-statistic
4o
You said:
ChatGPT said:
Here are detailed and useful notes based on the content you shared
regarding Attribute Selection Measures, Tree Pruning, Scalability,
and Visual Mining in Decision Tree Induction:
Other Attribute Selection Measures
Common Issues with Traditional Measures:
Information Gain:
Gain Ratio:
Gini Index:
Alternative Measures:
Multivariate Splits:
Summary:
Two Approaches:
Stop growing the tree early based on a threshold (e.g., gain, Gini).
Examples:
o Based on:
of leaves
Error rate
MDL-based Pruning:
Hybrid:
Problem:
Scalable Algorithms:
📌 RainForest
Efficient:
Key Concepts:
Interface:
Benefits:
Bayes’ Theorem
Let:
P(H∣ X)
This is the probability that hypothesis HHH holds given evidence XXX.
TERM MEANING
P ( X ∣ H ) ⋅ P( H )
P ( H ∣ X )=
P( X )
Example:
Rule-Based Classification
🔸 Overview
🔍 Structure of a Rule
Alternative representation:
✅ Rule Satisfaction
Let:
🧮 Coverage:
2
Coverage: =14.28 %
14
2
Accuracy: =100 %
2
2. If only one rule is triggered, that rule fires and assigns its class
label.
When multiple rules are triggered, we must choose which one to fire.
Common strategies include:
1. Size Ordering
2. Rule Ordering
o Rule accuracy
o Rule coverage
o Rule size
Rule-Based Ordering
🧠 Note: Rules in a decision list imply negation of the earlier rules. That is,
each rule assumes the previous ones didn’t apply.
📌 Basic Idea:
One rule is created per path from the root to a leaf node.
The splitting criteria (conditions) along the path form the
antecedent (IF part).
From a decision tree (as in Figure 8.2 of the textbook), the following rules
are extracted:
R1:
R2:
R3:
R4:
R5:
1. Mutually Exclusive:
2. Exhaustive:
3. Unordered Rules:
🔧 Purpose of Pruning:
Simplify rules.
Improve generalization.
⚠️Consequences of Pruning
🔹 Overview
After learning a rule: Remove the tuples it covers and repeat the
process with remaining data.
🔸 Popular Algorithms
AQ
CN2
Sequential
Feature Decision Tree
Covering
Rule
Directly from data Paths from tree
Generation
Mutually
Rule Set Can be overlapping
exclusive
Input:
Output:
- A set of IF-THEN rules
Method:
o Repeat:
IF [ ] THEN class = C
At each step:
Add this test to the rule, and repeat until rule quality is satisfactory.
🔍 Search Strategy
General-to-Specific Search:
Greedy Search:
No backtracking.
Beam width = k
Imagine a dataset with features like age, income, credit rating, etc.
Start with:
📈 Rule Quality
o Accuracy
o Coverage
o Confidence
o Information Gain
🛑 Termination Conditions
✅ Advantages
⚠️Limitations
Key Definitions:
Predicted Predicted
Positive Negative
Actual TP FN
Positive
Actual FP TN
Negative
Evaluation Metrics:
Procedure: Split data into two sets — training (usually 2/3) and test
(1/3).
Use: Model is trained on the training set and evaluated on the test
set.
2. Random Subsampling
3. k-Fold Cross-Validation
Procedure:
o Split data into k equal-sized folds.
🔸 Special Cases:
Stratified Cross-Validation:
4. Bootstrap Method
Procedure:
( )
d
1 −1
o Based on probability: 1− ≈ e ≈ 0.368
d
where:
Degrees of freedom = k - 1 = 9
If your computed t is larger than the table value, you reject the
null hypothesis (i.e., the difference is real).
💰 Cost–Benefit Analysis:
o Up for TP
o Right for FP
Example:
Interpretation:
The steeper and higher the curve, the better the model.
Mechanism: Each base classifier votes on the class label, and the
ensemble outputs the majority-voted label.
b. Boosting – Section
Solutions:
Analogy:
1. Input:
3. Prediction Phase:
Key Features:
Equal Weighting:
Advantages of Bagging:
Increased Accuracy:
Reduces Variance:
Intuition:
1. Initialize Weights:
o All training tuples are initially given equal weight: 1/d for d
data points.
o Update weights:
error ( M i )
w j ← w j⋅
1−error ( M i )
w i=log
( 1−error ( M i )
error ( M i ) )
o For a new instance X:
3. Random Forests
✅ Definition
Like asking multiple doctors (trees) for a diagnosis, then trusting the
majority vote or average prediction.
1. Ensemble Learning
2. Bagging
From the original dataset DDD with d samples, we generate k
datasets D_1, D_2, ..., D_ by random sampling with
replacement (bootstrap sampling).
Some original samples may appear multiple times; others may not
be selected at all.
The best split is chosen only among this subset, not from all
features.
5. Prediction
For classification:
For regression: