0% found this document useful (0 votes)
0 views

Data Mining Unit 2

Classification is a data analysis task aimed at predicting categorical class labels using a model built from input features. It has various real-world applications, including banking, marketing, and medical research, and is distinct from numeric prediction, which deals with continuous values. The classification process involves a training phase with labeled data and a testing phase with unlabeled data to evaluate accuracy, utilizing methods like decision trees and Bayesian classifiers.

Uploaded by

is7636665
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Data Mining Unit 2

Classification is a data analysis task aimed at predicting categorical class labels using a model built from input features. It has various real-world applications, including banking, marketing, and medical research, and is distinct from numeric prediction, which deals with continuous values. The classification process involves a training phase with labeled data and a testing phase with unlabeled data to evaluate accuracy, utilizing methods like decision trees and Bayesian classifiers.

Uploaded by

is7636665
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Classification

🔍 1. Definition of Classification

 Classification is a data analysis task used to predict


categorical class labels (discrete and unordered).

 A model (also called a classifier) is built to assign data instances


to predefined classes based on the input features.

🧠 2. Purpose & Real-World Applications

Application Use of Classification


Area

Banking Classify loan applicants as “safe” or “risky”

Marketing Predict whether a customer will buy a product (“yes” or


“no”)

Medical Predict suitable treatment (“treatment A”, “B”, or “C”)


research based on patient data

 In all these examples, the goal is to predict a label (class) from


input attributes using historical data.

🔁 3. Classification vs Numeric Prediction

Feature Classification Numeric Prediction

Output Discrete labels (e.g., Continuous values (e.g., total


safe/risky) purchase amount)

Data Type Categorical Numeric

Example Decision Trees, SVM Regression


Method

Output Class label Predictor


Name

Goal Assign to a category Predict a numeric value

Numeric prediction is commonly associated with regression analysis,


but other methods can also be used.
🔄 4. General Process of Classification

Step 1: Learning (Training Phase)

 Input: A set of training data tuples, where each tuple has:

o A set of input features:


X=(x1,x2,…, xn )X = (x_1, x_2, …, x_n ) X= ( x1 ,x2 ,…, xn )

o A known class label y

 Goal: Learn a mapping function:

f ( X )= y

 The mapping can be expressed as:

o Classification rules

o Decision trees

o Mathematical models

 This step is also called supervised learning, as the model is


trained with labeled examples.

Step 2: Classification (Testing Phase)

 Input: New, unlabeled data (test set).

 Classifier predicts class labels for these data.

 Accuracy Evaluation:

o Use test data (with known labels, but not seen during
training).

o Compare predicted class vs actual class.

o Calculate accuracy:
Correct predictions
Accuracy= × 100
Total test tuples

 The test data must be independent of the training data.


🧾 5. Key Terminology

Term Description

Tuple (X) A data instance, also called an example, sample, or


object

Attribute Vector Input feature vector X=(x1,x2,...,xn)X = (x_1,


x_2, ..., x_n)X=(x1,x2,...,xn)

Class Label Output category or target class (e.g., safe, risky)

Classifier / Learned function or rules for assigning class labels


Model

Training Set Data with known class labels used to train the
model

Test Set Data with known labels (not used during training)
used to test the model

Supervised Learning with labeled data


Learning

Unsupervised Learning without labeled data (e.g., clustering)


Learning

📊 6. Example: Loan Application Classification

Training Data

Name Age Incom Loan


e Decision

Sandy Youth Low Risky


Jones

Caroline Middle- High Safe


Fox aged

Susan Senior Low Safe


Lake

Learned Rules (Classifier Output)


IF age = youth THEN loan_decision = risky

IF income = high THEN loan_decision = safe

IF age = middle_aged AND income = low THEN loan_decision = risky

New Data Input (Test Tuple)

Name Age Incom Predicted


e Decision

John Middle- Low Risky


Henry aged

📌 7. Supervised vs Unsupervised Learning

Feature Supervised Unsupervised


Learning Learning

Input Data Labeled Unlabeled

Output Predict known class Group similar data


labels points

Example Classification Clustering


Task

Example Decision Trees, SVM K-Means, Hierarchical


Method

Known Yes No
Classes?

 Example of Unsupervised Learning:

o If loan decisions (safe/risky) were not known, we could use


clustering to group similar applications.

o These clusters could correspond to risk levels (even if we don't


know what “safe” or “risky” means at first).

🧠 8. Advantages of Classification Models

 Interpretable (in rule or tree form).

 Generalizable: Can be applied to unseen data.

 Compressed Representation: Reduces the need to store full data.

 Insights: Helps discover patterns and relationships in data.

🚫 Note on Overfitting

 A model trained too closely on the training data may overfit:


o Captures noise and anomalies.

o Performs poorly on new data.

 Hence, always test on unseen data to estimate real


performance.

Decision Tree Induction


 Definition: Decision Tree Induction is a process of learning decision
trees from class-labeled training data (tuples).

 Structure:

o Root node: Topmost node representing the first attribute


split.

o Internal nodes: Test on an attribute.

o Branches: Outcome of a test.

o Leaf nodes: Final class label (decision outcome).

🎯 Purpose of Decision Trees

 To classify an unknown tuple by tracing a path from root to leaf


based on attribute values.

 Can be easily translated into classification rules (IF-THEN


rules).

 Popular due to simplicity, interpretability, and ability to handle


multi-dimensional data.

✅ Advantages

 No domain knowledge or parameter tuning required.

 Good performance in many real-world tasks (medicine, finance,


astronomy, etc.).

 Easy to understand for humans.


 Fast learning and classification.

⚙️Basic Decision Tree Induction Algorithm

Input:

 D: Training data (tuples with class labels)

 Attribute list: List of candidate attributes

 Attribute selection method: Heuristic (e.g., Information Gain, Gini


Index)

Output:

 A Decision Tree

Steps:

1. Create a node N.

2. If all tuples in D belong to the same class → label N as a leaf node


with that class.

3. If attribute list is empty → label N as a leaf node with the majority


class in D.

4. Apply Attribute selection method to find the best attribute to split.

5. Label N with the splitting attribute.

6. Remove splitting attribute from attribute list (if multiway split is


allowed).

7. For each outcome of the split:

o Create subset Dj of tuples matching that outcome.

o If Dj is empty → attach a leaf node with majority class of D.

o Else → recursively call the algorithm on Dj.

🌳 Splitting Criteria Scenarios

1. Discrete-valued attribute:

o Each known value of A gets a separate branch.

o E.g., color? → red, green, blue (each gets a branch).

2. Continuous-valued attribute:
o Two branches using a split-point.

o Conditions: A ≤ split-point and A > split-point.

3. Binary tree with discrete attribute:

o Create a subset SA of values.

o Condition: A ∈ SA? → binary split into yes/no.

🛑 Stopping Conditions

1. All tuples belong to the same class.

2. No attributes left → majority voting is used.

3. Empty partition Dj → leaf with majority class of parent D.

Computational Complexity

 O(n × |D| × log(|D|)), where:

o n = number of attributes

o |D| = number of tuples

💡 Notable Algorithms

 ID3 (Iterative Dichotomiser 3) - by Quinlan

 C4.5 - successor of ID3

 CART (Classification and Regression Trees) - binary decision trees

🔄 Incremental Learning

 Some decision tree algorithms allow incremental updates,


adjusting the tree with new data instead of rebuilding it from
scratch.

Attribute Selection Measures


🔍 What is Information Gain?
Information Gain (IG) is a measure based on information theory,
particularly entropy, which was introduced by Claude Shannon.

 Entropy (Info(D)): It tells us how impure (or uncertain) the dataset


is. If a dataset contains mixed class labels (e.g., both "yes" and
"no"), entropy is high. If all tuples belong to one class, entropy is
zero.

 Formula for Entropy:


m
Info ( D )=−∑ pi log 2 ( pi )
i=1

where p i is the proportion of tuples in class C i .

🧠 How is Information Gain Calculated?

1. Compute entropy of the whole dataset Info(D).

2. Split the dataset D based on an attribute A:

o If A is discrete, partition D into subsets based on its distinct


values (e.g., "youth", "middle-aged", "senior").

o If A is continuous, evaluate possible split points (e.g.,


midpoints between adjacent sorted values).

3. Compute weighted average entropy after the split ( Inf o A ( D ) ) :

v
|D j|
Info A ( D )=∑ ⋅ Info ( D j )
j=1 | D|
where D j is the subset of tuples with outcomea j ​of attribute A .

4. Calculate the Information Gain:

Gain ( A )=Info ( D )−Info A ( D )

🌲 Why Use Information Gain in ID3?

 It selects the attribute that best reduces entropy, i.e., increases


purity.

 The attribute with the highest Information Gain is chosen for the
decision node.

 It aims to reduce uncertainty at each step and build a tree that


classifies data with as few questions as possible.
📚 Example with “Age” Attribute (from Table 8.1):

 Dataset has 14 tuples: 9 “yes” and 5 “no”

Info ( D )=− ( 149 log 149 + 145 log 145 )=0.940 bits
2 2

Then, for age:

 Youth: 2 yes, 3 no

 Middle-aged: 4 yes, 0 no

 Senior: 3 yes, 2 no

Calculate weighted average entropy Infoage ¿):

5 4 5
Inf o age ( D )= ⋅ 0.971+ ⋅0+ ⋅0.971=0.694 bits
14 14 14
Then,

Gain ( age )=0.940−0.694=0.246


Do this for all attributes, choose the one with the highest gain.

Gain Ratio
Gain Ratio is a metric used in decision tree algorithms (like C4.5) to select
the best attribute for splitting the dataset. It addresses a major limitation
of Information Gain, which tends to favor attributes with many distinct
values (like unique IDs), even when those attributes are poor for
generalization.

Why Not Just Use Information Gain?

Information Gain (IG) can be biased. For example:

 If an attribute like produc t id uniquely identifies each record, splitting


on it will create pure partitions, each with only one tuple.

 The entropy in each partition is 0, so IG is maximized.

 But such a split is useless for predicting new data (i.e., overfitting).

How Gain Ratio Fixes This


Gain Ratio penalizes attributes with many outcomes by normalizing the
Information Gain using a value called Split Information, which measures
how broadly and evenly the data is split.

Formulas

Let’s define:

 D : dataset

 D j: subset of D for outcome j

 v : number of distinct outcomes for attribute A

1. Split Information:
v
| D j|
SplitInfo A ( D )=−∑
j =1 | D|
This measures the potential information created by the split itself,
regardless of class labels.

2. Gain Ratio:

Gain ( A )
GainRatio ( A )=
SplitInfo A ( D )

This corrects the bias by dividing Information Gain by how “uniformly” the
attribute splits the data.

Note: If SplitInfo is very small (close to zero), GainRatio becomes unstable.


To avoid this, C4.5 ensures that the Information Gain is above average
before considering GainRatio.

Example
Given a split on income resulting in three partitions:

 low: 4 tuples

 medium: 6 tuples

 high: 4 tuples

Step 1: Compute SplitInfo

( 144 log 144 + 146 log 146 + 144 log 144 )=1.557
SplitInfoincome ( D )=− 2 2 2
Step 2: Use Gain from Example 8.1

Gain ( income ) =0.029

Step 3: Compute GainRatio


0.029
GainRatio ( income )= ≈ 0.019
1.557

Bayes Classification Methods

🔹 What Are Bayesian Classifiers?

 Bayesian classifiers are statistical classifiers that predict class


membership probabilities, like the likelihood that a data tuple
belongs to a specific class.

 They use Bayes’ theorem to make these predictions.

 Naïve Bayes, a simplified Bayesian method, performs comparably


to decision trees and neural networks.

 They are known for high accuracy and speed, especially on large
datasets.

🔹 Assumption of Naïve Bayes:

 Assumes class-conditional independence:

o The value of one attribute is independent of another, given


the class label.

o This simplification makes computations more efficient —


hence the term "naïve."

Gini Index
The Gini Index is used in the CART (Classification and Regression
Trees) algorithm to measure the impurity of a dataset.

🔸 Definition

For a dataset DDD with mmm classes:


m
Gini ( D )=1−∑ p 2i
i=1
pi: Proportion of tuples in class C i within dataset D

 A Gini index of 0 means the dataset is pure (only one class


present).

 A Gini index closer to 1 indicates high impurity (mixed classes).

🔹 Binary Splits

CART allows only binary splits (dividing into two groups). For each
attribute, we:

1. Try all possible binary splits.

2. Calculate the Gini index for each split.

3. Choose the split that minimizes the Gini index of the resulting
partitions.

🔸 Discrete Attributes

If attribute A has v values, then there are 2 v−22v −22 v−2 valid binary splits
(excluding the empty set and full set).

Example:

 If income has values {low, medium, high}, valid splits are:

o { low } , { medium } , { high } , {low , medium } , { low ,high } , { medium, high }


Each subset becomes a binary test like:

"Is income in {low, medium}?"

🔸 Continuous Attributes

For numeric attributes:

 Sort the values.

 Use midpoints between adjacent values as possible split points.

 Each split is of the form:

"Is A ≤ split_point?"

🔹 Weighted Gini Index After Split

If a split divides DDD into D1D_1D1 and D2D_2D2:


|D1| |D 2|
Gini A ( D )= ⋅Gini ( D1 )+ ⋅Gini ( D2 )
| D| |D|

🔹 Gini Gain (Reduction in Impurity)

Gain ( A )=Gini ( D )−Gini A ( D )

Choose the attribute and split that maximizes the gain (i.e.,
minimizes Gini A ¿ .

🔹 Example 8.3 – Gini Index in Action

Let the dataset DDD have:

 9 tuples of class "yes"

 5 tuples of class "no"

Step 1: Initial Impurity of D

( ) ( )
2 2
9 5
Gini ( D )=1− − =0.459
14 14

Step 2: Evaluate Attribute income

Split: income in {low, medium}

 D₁ = 10 tuples: 7 yes, 3 no → Gini(D₁) = 1 - (0.7² + 0.3²) = 0.42

 D₂ = 4 tuples: 2 yes, 2 no → Gini(D₂) = 0.5


10 4
Gini income ( D )= ⋅0.42+ ⋅0.5=0.443
14 14

Other splits gave higher values:

 {low, high} → Gini = 0.458

 {medium, high} → Gini = 0.450

✅ Best income split: {low, medium} with Gini = 0.443

Step 3: Evaluate Other Attributes

 age = {youth, senior} → Gini = 0.357

 student → Gini = 0.367


 credit_rating → Gini = 0.429

✅ Best overall split:


age = {youth, senior}

 Gini reduction = 0.459 - 0.357 = 0.102

So, age is selected as the root node's splitting attribute.

🔷 Other Attribute Selection Measures

These are used to determine which attribute to split the dataset on at


each node of the tree:

1. Information Gain:

o Biased towards attributes with many distinct values.

o Might lead to overfitting with multivalued attributes.

2. Gain Ratio:

o Adjusts information gain by incorporating split information.

o Tends to favor unbalanced splits.

3. Gini Index:

o Also biased toward multivalued attributes.

o Prefers equal-sized and pure partitions.

o Struggles with datasets having many classes.

4. Other Measures:

o CHAID: Uses chi-square test; often used in marketing.

o C-SEP: Sometimes better than information gain and Gini.

o G-statistic

4o

You said:

detailed notesof useful

ChatGPT said:

Here are detailed and useful notes based on the content you shared
regarding Attribute Selection Measures, Tree Pruning, Scalability,
and Visual Mining in Decision Tree Induction:
Other Attribute Selection Measures
Common Issues with Traditional Measures:

 Information Gain:

o Biased towards attributes with many values (multi-valued).

 Gain Ratio:

o Adjusts for bias of Information Gain but prefers unbalanced


splits (e.g., one partition is much smaller).

 Gini Index:

o Also biased toward multi-valued attributes.

o Struggles with many-class problems.

o Prefers equal-sized and pure partitions.

Alternative Measures:

 CHAID: Uses the Chi-square (χ²) test of independence.

 C-SEP: Performs better than Information Gain & Gini in certain


scenarios.

 G-statistic: Similar to χ², information-theoretic.

 MDL (Minimum Description Length):

o Least biased toward multi-valued attributes.

o Chooses trees that minimize encoding cost (both tree and


misclassified exceptions).

 Multivariate Splits:

o Combine multiple attributes (e.g., linear combinations).

o Used in systems like CART.

o Essentially a type of attribute construction (discussed in


data transformation).

Summary:

 No measure is universally best.

 Measures producing shallower trees (multiway, balanced splits)


are often preferred.
 Shallower trees can increase leaf count and error – it's a tradeoff.

Tree Pruning (Handling Overfitting)


Purpose:

 To avoid overfitting by removing branches reflecting noise or


outliers.

Two Approaches:

📌 1. Prepruning (Early Stopping)

 Stop growing the tree early based on a threshold (e.g., gain, Gini).

 Node becomes a leaf with:

o Most frequent class

o Or probability distribution of classes

 ⚠️Challenge: Choosing the right threshold (high = oversimplified,


low = ineffective)

📌 2. Postpruning (Most Common)

 Build full tree first, then prune.

 Replace subtrees with a leaf node.

 Label = most frequent class in subtree.

Examples:

 Cost Complexity Pruning (CART):

o Based on:

 of leaves

 Error rate

o Uses a pruning set (different from training/test set).

 Pessimistic Pruning (C4.5):

o Uses training data, no separate prune set.

o Adjusts for optimism by adding a penalty to error estimate.

 MDL-based Pruning:

o Prune based on encoding length.


o Simplest (shortest encoded) tree is preferred.

o No need for a separate prune set.

Hybrid:

 Combine prepruning and postpruning.

🔶 3. Scalability of Decision Tree Induction

Problem:

 Traditional algorithms (ID3, C4.5, CART) require all training data


in memory.

 Not suitable for large-scale, disk-resident datasets.

Scalable Algorithms:

📌 RainForest

 Uses AVC-sets (Attribute-Value-Classlabel) at each node.

 Keeps class distributions for each attribute.

 Efficient because size depends on # distinct values and classes,


not # of tuples.

 Handles memory overflow gracefully.

 Compatible with any decision tree algorithm.

📌 BOAT (Bootstrapped Optimistic Algorithm for Tree Construction)

 Uses bootstrapping to build decision trees from memory-fit


samples.

 Constructs multiple trees from subsets → merges them to form final


tree.

 Efficient:

o Only 2 scans of data.

o Faster than RainForest.

 Supports incremental updates (can adjust for new/deleted data


without full reconstruction).

🔶 4. Visual Mining for Decision Tree Induction

PBC (Perception-Based Classification):


 Interactive decision tree construction.

 Helps users visualize and understand multidimensional data.

Key Concepts:

 Uses pixel-based visualization:

o Data mapped to a circle segmented by attributes.

o Each pixel represents a value-color for the class label.

 Visualization of homogeneous class regions helps in manual


splitting.

Interface:

 Data Interaction Window: shows circle segments of data.

 Knowledge Interaction Window: shows decision tree built so far.

Benefits:

 Allows manual split selection (one or more split points).

 Can generate multiway splits, especially for numeric attributes.

 Trees are usually smaller and more interpretable.

 Supports human-in-the-loop learning by combining domain expertise


with algorithmic insights.

Bayes’ Theorem
Let:

 X: a data tuple (evidence).

 H: a hypothesis, e.g., “X belongs to class C.”

We want to compute the posterior probability:

P(H∣ X)

This is the probability that hypothesis HHH holds given evidence XXX.

🔹 Key Probability Terms:

TERM MEANING

P(H) Prior probability of hypothesis HHH — before


seeing data

P(X) Prior probability of evidence XXX — how often this


data occurs

P(X ∣H) Likelihood — probability of seeing evidence XXX


given HHH

P(H∣ X) Posterior probability — updated belief in HHH after


seeing XXX

🔹 Bayes’ Theorem Formula:

P ( X ∣ H ) ⋅ P( H )
P ( H ∣ X )=
P( X )

Example:

 H: Customer buys a computer.

 X: Customer is 35 years old with $40,000 income.

 Goal: Compute probability that this customer will buy a computer


given age and income.

Rule-Based Classification
🔸 Overview

A rule-based classifier is a classification system that uses a set of IF-


THEN rules to classify data tuples. These rules are easy to understand,
implement, and interpret, making them ideal for knowledge
representation and decision-making tasks.

🔹 Using IF-THEN Rules for Classification

🔍 Structure of a Rule

A rule is expressed in the form:

IF <condition> THEN <class>

 Antecedent (Precondition): The "IF" part, which is a combination


of one or more attribute tests (conditions). These are connected
using logical AND.
 Consequent (Conclusion): The "THEN" part, which contains the
predicted class label.

Example Rule (R1):


IF age= youth∧student= yes THEN buys computer = yes

Alternative representation:

(age= youth)∧(student= yes)→(buys computer = yes)

✅ Rule Satisfaction

 A rule is satisfied (triggered) if all attribute conditions in its


antecedent are true for a tuple.

 If a rule is triggered by a tuple, it is said to cover that tuple.

📊 Rule Evaluation: Coverage and Accuracy

Let:

 ncovers = number of tuples covered (i.e., satisfied) by the rule

 ncorrect = number of correctly classified tuples among those covered

 |D| = total number of tuples in the dataset

🧮 Coverage:

Percentage of tuples in the dataset that are covered by the rule:


ncovers
coverage ( R )=
| D|
🧮 Accuracy:

Percentage of correctly classified tuples among those that are covered:


ncorrect
accuracy ( R )=
ncovers
📌 Example (R1):

 ncovers = 2, ncorrect = 2, |D| = 14

2
 Coverage: =14.28 %
14
2
 Accuracy: =100 %
2

📥 Rule Firing and Prediction

To classify a new tuple X, we follow this process:

1. Check which rules are satisfied by X (triggered).

2. If only one rule is triggered, that rule fires and assigns its class
label.

3. If multiple rules are triggered, use a conflict resolution


strategy.

4. If no rule is triggered, use a default rule.

⚔️Conflict Resolution Strategies

When multiple rules are triggered, we must choose which one to fire.
Common strategies include:

1. Size Ordering

 Preference is given to the rule with the most conditions (most


specific rule).

 Rule with the largest antecedent size (most attribute tests) is


selected.

2. Rule Ordering

 Rules are arranged in a priority list (called a decision list).

 The first rule in the list that is triggered fires.

 Ordering criteria can include:

o Rule accuracy

o Rule coverage

o Rule size

o Domain expert knowledge

There are two kinds of rule ordering:

 Class-Based Rule Ordering


o Classes are sorted by importance (e.g., frequency or cost).

o Rules within each class are unordered.

 Rule-Based Ordering

o Rules are ordered globally, regardless of class.

o The first matching rule fires.

🧠 Note: Rules in a decision list imply negation of the earlier rules. That is,
each rule assumes the previous ones didn’t apply.

🛑 What if No Rule Matches?

 A default rule can be used.

 It has no condition and applies only if no other rule is


triggered.

 Typically predicts the majority class (overall or among uncovered


tuples).

📘 Rule Extraction from a Decision Tree


🔸 Overview

Decision trees are a widely used classification method due to their


simplicity and effectiveness. However, large trees can be hard to
interpret. To improve interpretability, we can convert a decision tree into
a set of IF-THEN classification rules.

🔹 Why Extract Rules?

 Rules are often easier for humans to understand than a large


decision tree.

 They represent knowledge in a modular, interpretable way.

 Rule sets can be used as an alternative model to a tree classifier.

🔹 How to Extract Rules from a Decision Tree

📌 Basic Idea:

 One rule is created per path from the root to a leaf node.
 The splitting criteria (conditions) along the path form the
antecedent (IF part).

 The class label at the leaf node becomes the consequent


(THEN part).

🧪 Example – Extracting Rules

From a decision tree (as in Figure 8.2 of the textbook), the following rules
are extracted:

R1:

IF age = youth AND student = no THEN buys_computer = no

R2:

IF age = youth AND student = yes THEN buys_computer = yes

R3:

IF age = middle-aged THEN buys_computer = yes

R4:

IF age = senior AND credit_rating = excellent THEN buys_computer = yes

R5:

IF age = senior AND credit_rating = fair THEN buys_computer = no

⚙️Properties of Extracted Rules

1. Mutually Exclusive:

o No two rules apply to the same tuple.

o Each tuple maps to exactly one path in the tree.

2. Exhaustive:

o There is a rule for every possible attribute-value combination.

o No need for a default rule.

3. Unordered Rules:

o Since the rules are mutually exclusive, order does not


matter.
🧹 Rule Pruning

Although rule extraction is straightforward, the resulting rule set can


be large and contain redundant or irrelevant conditions. Therefore,
rule pruning is necessary.

🔧 Purpose of Pruning:

 Simplify rules.

 Improve generalization.

 Eliminate unnecessary conditions or rules.

✂️How to Prune Rules

For a given rule:

 Remove any condition that does not improve the rule’s


estimated accuracy.

 This leads to a more general and concise rule.

📉 C4.5 Rule Pruning Approach

 Extracts rules from an unpruned tree.

 Uses a pessimistic pruning approach:

o Rule accuracy is estimated from training data.

o To avoid optimistic bias, the estimate is adjusted


pessimistically.

 Any rule that does not improve overall accuracy is removed.

⚠️Consequences of Pruning

 After pruning, the rules are no longer mutually exclusive or


exhaustive.

 This can lead to rule conflicts or uncovered tuples.

⚔️Conflict Resolution (Post-Pruning)

To manage conflicts among overlapping rules, C4.5 uses class-based


rule ordering:
1. Group rules by class.

2. Rank class rule sets to minimize false positives:

o A false positive occurs when a rule predicts class C but the


actual class is not C.

3. Evaluate class rule sets in order of increasing false positives.

🧾 Default Class Selection (C4.5)

 Default class is NOT the majority class.

 Instead, it is the class that contains the most uncovered training


tuples.

 This strategy ensures better handling of tuples not matched by any


rule.

📘 Rule Induction Using a Sequential Covering


Algorithm

🔹 Overview

 Goal: Generate a set of IF-THEN classification rules directly


from training data, without building a decision tree first.

 Approach: Learn one rule at a time for each class.

 Method: Known as Sequential Covering Algorithm, widely used


for disjunctive rule learning.

🔸 Why “Sequential Covering”?


 Sequential: Rules are learned one at a time.

 Covering: Each rule ideally covers many tuples of the target


class and few or none of the others.

 After learning a rule: Remove the tuples it covers and repeat the
process with remaining data.

🔸 Popular Algorithms

Some implementations of sequential covering:

 AQ

 CN2

 RIPPER (Repeated Incremental Pruning to Produce Error


Reduction)

📌 Comparison with Decision Tree Induction

Sequential
Feature Decision Tree
Covering

Learning Entire tree at


One rule at a time
Style once

Rule
Directly from data Paths from tree
Generation

Tuple Covered tuples


All tuples used
Handling removed

Mutually
Rule Set Can be overlapping
exclusive

🔧 Basic Sequential Covering Algorithm (Steps)

Input:

- D: Dataset of class-labeled tuples

- Att vals: Set of all attribute-value pairs

Output:
- A set of IF-THEN rules

Method:

1. Initialize rule set as empty.

2. For each class c:

o Repeat:

 Learn a rule for class c.

 Remove tuples covered by this rule from D.

 Add rule to the rule set.

o Until a terminating condition is met.

3. Return the complete rule set.

🧠 Rule Learning (Learn One Rule)

 Rule learning is done in a general-to-specific manner (greedy


strategy).

 Start: with the most general rule (empty antecedent).

IF [ ] THEN class = C

 At each step:

o Consider adding attribute tests (like income = high, age >


25).

o Select the test that gives the best improvement in rule


quality.

 Add this test to the rule, and repeat until rule quality is satisfactory.

🔍 Search Strategy

General-to-Specific Search:

 Begins with the most general rule.

 Gradually specialize by adding conjuncts (AND conditions).

Greedy Search:

 No backtracking.

 Always picks the best local option at each step.


 Fast but may make sub-optimal choices.

Beam Search (Improved):

 Maintain top k candidates instead of just 1.

 Reduces risk of bad decisions due to local optima.

 Beam width = k

📊 Example: Loan Decision Data

Imagine a dataset with features like age, income, credit rating, etc.

 Start with:

IF [ ] THEN loan_decision = accept

 Add best test:

IF income = high THEN loan_decision = accept

 Add next best:

IF income = high AND credit_rating = excellent THEN loan_decision =


accept

 Continue until rule reaches acceptable accuracy or quality


threshold.

📈 Rule Quality

 Rules are evaluated based on quality measures, e.g.:

o Accuracy

o Coverage

o Confidence

o Information Gain

 The goal is to maximize accuracy while keeping the rule simple.

🛑 Termination Conditions

Rule induction for a class stops when:

 No more tuples of that class remain.

 No high-quality rule can be learned.


 A predefined number of rules has been generated.

 Rule’s accuracy falls below a threshold.

✅ Advantages

 No need to build a full decision tree.

 Can produce compact, interpretable rule sets.

 Flexible in handling overlapping classes.

⚠️Limitations

 Greedy strategy may miss better rules.

 Tuple removal can sometimes lose useful training data.

 Beam search helps but increases computational complexity.

Model Evaluation and Selection


Why Evaluate a Model?

 To estimate how well a classifier will perform on unseen (future)


data.

 To compare multiple classifiers and choose the best one.

 Evaluation helps avoid overfitting and provides realistic performance


expectations.

8.5.1: Metrics for Evaluating Classifier Performance

Key Definitions:

 True Positives (TP): Correctly predicted positive cases.


 True Negatives (TN): Correctly predicted negative cases.

 False Positives (FP): Negative cases incorrectly predicted as


positive.

 False Negatives (FN): Positive cases incorrectly predicted as


negative.

Confusion Matrix (Binary Class Example):

Predicted Predicted
Positive Negative
Actual TP FN
Positive
Actual FP TN
Negative

Evaluation Metrics:

METRIC FORMULA MEANING

ACCURACY (TP + TN) / (P + N) Overall correctness

ERROR RATE (FP + FN) / (P + N) = 1 - Rate of incorrect


Accuracy predictions

SENSITIVITY TP / P True Positive Rate


(RECALL / TPR) (correct positives)

SPECIFICITY TN / N True Negative Rate


(TNR) (correct negatives)

PRECISION TP / (TP + FP) Correctness of


predicted positives

F1 SCORE 2 * (Precision * Recall) / Harmonic mean of


(Precision + Recall) precision and recall

FΒ SCORE (1 + β²) * (Precision * Weighted F-measure


Recall) / ((β² * Precision) +
Recall)

 F1: Equal weight to precision and recall.

 F2: Recall is more important.

 F0.5: Precision is more important.


When Accuracy is Misleading

 In imbalanced datasets, accuracy can be high even if the


classifier fails on the minority (important) class.

o Example: If only 3% of cases are positive (e.g., cancer), a


97% accuracy might still mean all positives are misclassified.

o Use sensitivity, specificity, precision, and F-measures


instead.

Additional Classifier Evaluation Criteria

 Speed: Time to train and test the model.

 Robustness: Performance with noisy or missing data.

 Scalability: Ability to handle increasing data size efficiently.

 Interpretability: How understandable the model is (e.g., decision


trees vs. neural networks).

Model Evaluation Methods


1. Holdout Method

 Procedure: Split data into two sets — training (usually 2/3) and test
(1/3).

 Use: Model is trained on the training set and evaluated on the test
set.

 Drawback: Accuracy estimate is pessimistic because only part of


the data is used for training.

2. Random Subsampling

 Variation of holdout method.

 Procedure: Repeat the holdout method k times with different


random splits.

 Accuracy Estimate: Average accuracy over k iterations.

3. k-Fold Cross-Validation

 Procedure:
o Split data into k equal-sized folds.

o For each iteration, one fold is used as test set, remaining k −1


folds for training.

 Advantage: Every tuple is used exactly once for testing and k −1


times for training.

 Common choice: k =10is popular due to low bias and variance.

🔸 Special Cases:

 Leave-One-Out Cross-Validation (LOOCV):

o k=n, where n = number of samples.

o Each test set contains only one tuple.

 Stratified Cross-Validation:

o Ensures each fold has approximately the same class


distribution as the full dataset.

o More reliable for imbalanced datasets.

4. Bootstrap Method

 Procedure:

o Create bootstrap sample by sampling with replacement


from the dataset (same size as original).

o Tuples not selected form the test set.

 Key Insight: On average:

o 63.2% of tuples appear in training set.

o 36.8% form the test set.

( )
d
1 −1
o Based on probability: 1− ≈ e ≈ 0.368
d

 .632 Bootstrap Accuracy Estimate :


k
1
Acc ( M )= ∑ ( 0.632⋅ Acc ( M i / test ) + 0.368⋅ Acc ( M i / train ) )
k i=1

 Use Case: Effective for small datasets, but may be overly


optimistic.
🔍 Model Selection Using Statistical Tests of
Significance
When you have two models (M1 and M2) and use 10-fold cross-
validation to compare their performance, it's not enough to simply pick
the model with the lowest average error rate. The difference might just be
due to random variation.

✅ What you need:

 Multiple rounds of 10-fold cross-validation (say 10 runs).

 Calculate mean error rate and standard deviation of errors for


each model.

 Use the paired Student’s t-test to check if the difference is


statistically significant.

🧪 The Paired t-test Formula:

To compare error rates of M1 and M2 over 10 rounds (k=10):

t=( mean ( err . M 1 ) −mean ( err . M 2 ) ) / √ ( var ( M 1−M 2 ) / k )

where:

var ( M 1−M 2 )=( 1 / k )∗Σ [ ( err . M 1i−err . M 2i−mea nd iff )2 ]

 Degrees of freedom = k - 1 = 9

 Use a t-distribution table to find the critical t-value for a 5%


significance level (look up 0.025 because it's two-tailed).

 If your computed t is larger than the table value, you reject the
null hypothesis (i.e., the difference is real).

⚖️Comparing Classifiers Based on Cost–Benefit and ROC Curves

💰 Cost–Benefit Analysis:

In some applications, false positives and false negatives have


different consequences:

 Example: Cancer diagnosis

o False negative (missed cancer): very dangerous

o False positive (incorrect cancer): more tests, costs


Instead of accuracy, compute average cost per decision by assigning
real costs to TP, FP, TN, and FN.

📈 ROC Curve (Receiver Operating Characteristic Curve)

An ROC curve plots:

 TPR (True Positive Rate) on the y-axis

 FPR (False Positive Rate) on the x-axis

Steps to Plot ROC:

1. Use a probabilistic classifier (e.g., Naive Bayes, neural nets).

2. Rank test tuples by their predicted probability of being positive.

3. Set different thresholds and compute TPR & FPR.

4. Plot points (FPR, TPR). Move:

o Up for TP

o Right for FP

Example:

If a tuple with high probability is actually positive → TP ↑ → move up.

Interpretation:

 The steeper and higher the curve, the better the model.

 Area Under Curve (AUC):

o AUC = 1 → perfect model

o AUC = 0.5 → random guessing

o The larger the AUC, the better

Comparing Models (e.g., M1 vs M2):

 If M1’s ROC curve is above M2’s, then M1 is more accurate.

 Figure 8.20 shows M1 better than M2.

Techniques to Improve Classification Accuracy


1. Ensemble Methods
 Definition: An ensemble method is a technique that combines
multiple classifiers (models) to create a composite model that is
generally more accurate than any individual model.

 Purpose: Boost classification accuracy and model robustness.

 Mechanism: Each base classifier votes on the class label, and the
ensemble outputs the majority-voted label.

Why Ensembles Work

 Error Reduction: An ensemble reduces the chance of


misclassification because it only misclassifies when the majority of
base classifiers are wrong.

 Diversity: Ensembles are more effective when the base classifiers


are diverse (i.e., they make different kinds of errors).

 Better Than Random Guessing: Each classifier should perform


better than random chance.

 Parallelizability: Since each model can be trained independently,


ensemble methods can be efficiently parallelized.

2. Key Ensemble Methods

a. Bagging (Bootstrap Aggregating) – Section

 Trains each classifier on a different bootstrap sample of the data.

 Reduces variance and helps avoid overfitting.

b. Boosting – Section

 Trains classifiers sequentially, where each new model focuses on the


errors of the previous ones.

 Aims to reduce bias and improve performance on hard-to-classify


examples.

c. Random Forests – Section

 A special type of bagging that uses decision trees.

 Adds extra randomness by choosing a random subset of features for


each split in the tree.

 Very effective and widely used.

3. Class Imbalance Problem – Section


 Definition: Occurs when one class significantly outnumbers another
(e.g., fraud detection, rare disease).

 Challenge: Standard classifiers may be biased toward the majority


class.

 Solutions:

o Data-level approaches (e.g., oversampling minority class,


undersampling majority class)

o Algorithmic-level approaches (e.g., cost-sensitive learning,


ensemble methods tailored for imbalance)

4. Example: Decision Boundaries

 A single decision tree struggles with representing linear decision


boundaries .

 An ensemble of decision trees can approximate the linear boundary


more accurately.

 This demonstrates how ensembles can generalize better.

1. Bagging (Bootstrap Aggregating)


Concept:

Bagging is an ensemble technique that improves classification (or


regression) accuracy by reducing variance through averaging multiple
models trained on different subsets of data.

Analogy:

 Just like seeking diagnoses from multiple doctors and deciding


based on majority vote, bagging trains multiple models
(classifiers) and predicts the class of an instance based on a
majority vote from those models.

How Bagging Works:

1. Input:

o Dataset D with d tuples (instances).

o Number of models (iterations) k.

o A base learning algorithm (e.g., decision trees, naive Bayes).


2. Training Phase (Ensemble Creation):
For each i = 1 to k:

o Create a bootstrap sample Di by sampling d tuples from D


with replacement.

o Train a model Mi using the learning algorithm on Di.

3. Prediction Phase:

o To classify a new instance X:

 Let each model Mi predict the class label of X.

 Use majority voting among all models to determine


the final class label.

o For regression tasks, take the average of all predictions.

Key Features:

 Sampling with Replacement:

o Some original tuples may be repeated in a Di, while others


may be left out.

 Equal Weighting:

o All classifiers contribute equally to the final prediction.

Advantages of Bagging:

 Increased Accuracy:

o The ensemble model usually performs better than a single


model trained on the full dataset.

 Reduces Variance:

o Makes the model less sensitive to fluctuations in the training


data.

 Robust to Overfitting and Noise:

o Especially effective for high-variance models like decision


trees.

2. Boosting and AdaBoost


Concept:
Boosting is an ensemble method that improves accuracy by combining
multiple weak learners. Unlike bagging (where all models are equally
weighted), boosting gives more weight to accurate classifiers and
focuses learning on previously misclassified data.

Intuition:

 Like consulting multiple doctors and weighing their opinions


based on past accuracy.

 Classifiers are trained sequentially, and each new model focuses


more on the difficult cases misclassified by the previous ones.

How Boosting Works:

1. Initialize Weights:

o All training tuples are initially given equal weight: 1/d for d
data points.

2. Iteratively Train k Classifiers (Rounds):


For each round i from 1 to k:

o Sample a training set Di from D with replacement, based


on current tuple weights.

o Train a model Mi on Di.

o Evaluate error rate of Mi on Di using:


d
error ( M i) =∑ w j ⋅err ( X j )
j=1

where err ( Xj ) =1 if misclassified, else 0.

o If error(Mi) > 0.5, discard Mi and retry the round.

o Update weights:

 Increase weights of misclassified tuples.

 Decrease weights of correctly classified tuples:

error ( M i )
w j ← w j⋅
1−error ( M i )

o Normalize the weights so that they sum to 1.

3. Final Ensemble Prediction:


o Each model’s vote is weighted by its accuracy:

w i=log
( 1−error ( M i )
error ( M i ) )
o For a new instance X:

 Each model Mi predicts a class label.

 Add wi to that class’s total vote.

 Return the class with the highest total weighted


vote.

3. Random Forests

✅ Definition

Random Forest is an ensemble learning technique that builds multiple


decision trees and merges their predictions to produce more accurate
and stable results. It is applicable to both classification and
regression tasks.

🧠 Intuition Behind Random Forest

 A single decision tree is prone to overfitting and sensitive to


data noise.

 Random Forest combines many decision trees, each trained slightly


differently, to reduce variance without increasing bias.

 Like asking multiple doctors (trees) for a diagnosis, then trusting the
majority vote or average prediction.

⚙️Key Concepts and Working

1. Ensemble Learning

 Combines the predictions from multiple models.

 Random Forest is based on Bagging (Bootstrap Aggregating).

2. Bagging
 From the original dataset DDD with d samples, we generate k
datasets D_1, D_2, ..., D_ by random sampling with
replacement (bootstrap sampling).

 Some original samples may appear multiple times; others may not
be selected at all.

3. Random Feature Selection

 At each node in a decision tree, a random subset of features FFF


is selected from the total feature set.

 The best split is chosen only among this subset, not from all
features.

4. Tree Building Process

 Each tree is trained on a different bootstrap sample.

 At each split in the tree:

o Choose F random features.

o Use CART algorithm (Classification And Regression Trees) to


determine the best split among those features.

 Trees are grown to full depth and not pruned.

5. Prediction

 For classification:

o Each tree gives a class vote.

o The class with the majority vote is the final prediction.

 For regression:

o Each tree predicts a numeric value.

o The average of all predictions is taken.

You might also like