0% found this document useful (0 votes)
53 views26 pages

EBUS537 Theme4 Week 5

This document discusses classification techniques and practical issues in classification. It introduces decision tree classification and explains concepts like underfitting, overfitting, generalization error, and Occam's razor. Examples are provided to illustrate underfitting due to insufficient data or features, and overfitting due to noise. Methods for estimating generalization error from the training dataset are also presented.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views26 pages

EBUS537 Theme4 Week 5

This document discusses classification techniques and practical issues in classification. It introduces decision tree classification and explains concepts like underfitting, overfitting, generalization error, and Occam's razor. Examples are provided to illustrate underfitting due to insufficient data or features, and overfitting due to noise. Methods for estimating generalization error from the training dataset are also presented.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

22/10/2023

Classification: Practical Issues &


Application
(Theme 4-3)
Prof. Dongping Song
University of Liverpool Management School
Email: [email protected]

Partially based on Tan, P.N., Steinbach, M., and Kumar, V. (2014) Introduction to Data Mining, Pearson

Learning Outcomes
• Introduce classification techniques & applications
• Illustrate decision tree classification technique
• Appreciate advs and disadvs of decision tree methods
• Explain Hunt's algorithm
• Determine how to split the records
– Specify the attribute test condition
– Impurity measures: Gini, Entropy, Misclassification Error
– Gain, Information Gain, Gain Ratio
• Determine when to stop splitting
• Understand feature importance & Gini importance
• Discuss practical issues of classification
• Random forest
• Measure classification models
• An application of classification predictive analytics at seaport

1
22/10/2023

Practical Issues of Classification


• Underfitting
• Overfitting
• Generalization Error
• Occam’s Razor
• Pruning
• Noise
Data preparation
• Missing values

Q. What is training error of a classification model?


Q. What is testing error of a classification model?

Underfitting & Overfitting in Decision Tree


Underfitting Overfitting

Both training and Lose some generalization


test errors are large capability

2
22/10/2023

Underfitting & Overfitting

Overfitting; Underfitting; Good balance

Underfitting due to Insufficient Data or


Feature
Example 1.
X X
X
X

X X

Example 2. Predict a student’s college GPA simply based on


a student’s SAT score:
College GPA = Θ * (SAT Score)

3
22/10/2023

Overfitting Due to Noise

Decision boundary is distorted by noise point

Underfitting due to Lack of Samples:


An Example Training Set
Mammals

Name Body Temp. Gives Birth 4-legged Hibernates Class Label

Salamander Cold-bld No Yes Yes No

Guppy Cold-bld Yes No No No

Eagle Warm-bld No No No No

Poorwill Warm-bld No No Yes No

Platypus Warm-bld No Yes Yes Yes

4
22/10/2023

An Example Test Set to classify


mammals Mammals
Name Body Temp. Gives Birth Four-legged Hibernates Class Label
Human Warm-bld Yes No No Yes
Pigeon Warm-bld No No No No
Elephant Warm-bld Yes Yes No Yes
Leopard Cold-bld Yes No No No
shark
Turtle Cold-bld No Yes No No
Penguin Cold-bld No No No No
Eel Cold-bld No No No No
Dolphin Warm-bld Yes No No Yes
Spiny Warm-bld No Yes Yes Yes
anteater
Gila Cold-bld No Yes Yes No
monster
To determine

Underfitting due to Lack of Samples

Training error = 0%
Test error = 30%.
Humans,
elephants, There is only one training
dolphins records, warm-blooded that do
not hibernate, which is an eagle.

5
22/10/2023

Overfitting Due to Noise: An Example


Training Set to classify mammals
Mammals
Name Body Temp. Give Birth 4-legged Hibernates Class Label
Porcupine Warm-bld Yes Yes Yes Yes
Cat Warm-bld Yes Yes No Yes
Bat Warm-bld Yes No Yes No*
Whale Warm-bld Yes No No No*
Salamander Cold-bld No Yes Yes No

Komodo dragon Cold-bld No Yes No No

Python Cold-bld No No Yes No


Salmon Cold-bld No No No No
Eagle Warm-bld No No No No
Guppy Cold-bld Yes No No No

Note: Asterisks denote mislabelings (noise)

Overfitting Due to Noise: An Example


Test Set to classify mammals
Mammals
Name Body Temp. Gives Birth Four-legged Hibernates Class Label
Human Warm-bld Yes No No Yes
Pigeon Warm-bld No No No No
Elephant Warm-bld Yes Yes No Yes
Leopard Cold-bld Yes No No No
shark
Turtle Cold-bld No Yes No No
Penguin Cold-bld No No No No
Eel Cold-bld No No No No
Dolphin Warm-bld Yes No No Yes
Spiny Warm-bld No Yes Yes Yes
anteater
Gila Cold-bld No Yes Yes No
monster
To determine

6
22/10/2023

Overfitting Due to Noise

Bat, Whale
Spiny anteater Spiny anteater

Model 1: test error = 30%.


Model 2: test error = 10%; training error = 20%
humans, dolphins

Overfitting & Generalization Error


• Overfitting results in decision trees that are more complex
than necessary

• When there is noise, training error no longer provides a good


estimate of how well the tree will perform on previously
unseen records.

• Generalization error is a measure of how accurately a model


is able to predict outcome values for unseen data.

• Need methods to estimate generalization error. How?

7
22/10/2023

Estimating Generalization Error


• Estimate generalization error based on training dataset:
– Optimistic approach: e’(Model) = e(Model)
– Pessimistic approach: Training
error
e’(Model) = e(Model) + a*Complexity(Model)
– Pessimistic approach for decision tree:
• Penalty for each leaf node: e’(t) = (e(t)+0.5)
• Total errors: e’(T) = e(T) + NL  0.5 NL = num of leaf nodes
• GE in %: e’(Model) = (e(T) + NL  0.5) / S. S = num of samples

For a tree with 30 leaf nodes and 10 errors on training dataset with 1000 instances:
 Training error = 10/1000 = 1%
 Generalization error = (10 + 30*0.5)/1000 = 2.5%

Tan, P.N., Steinbach, M., and Kumar, V. (2014) Introduction to Data Mining, Pearson

Occam’s Razor
• Occam's razor is the philosophical idea (principle) that, given
two models of similar generalization errors, the simpler
model is preferable to the more complex model.

• Einstein's razor: the principle of making scientific models as


simple as possible but not simpler.

"Everything should be made as simple as possible, but not simpler" --


ALBERT EINSTEIN

8
22/10/2023

From Theory to Practice


Let’s look at how to turn these ideas of model selection
criteria into practice to avoid overfitting

Decision Tree Pruning Methodologies


• Pre-pruning (top-down)
– Stopping criteria while growing the tree
• Post-pruning (bottom-up)
– Grow the tree, then prune
– More popular

How to Address Overfitting


• Pre-Pruning (Early termination rule)
To enforce more restrictive conditions before it becomes a
fully-grown tree, e.g.
– Stop if number of instances is less than some user-specified
threshold
– Stop if class distributions of instances are independent of
the available features (e.g., using  2 test)
– Stop if expanding the current node does not improve
impurity measures (e.g., Gini or information gain).

9
22/10/2023

How to Address Overfitting


• Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a bottom-up fashion
– If generalization error improves after trimming, replace
sub-tree by a leaf node.
– The class label of leaf node is determined from majority
class of instances in the sub-tree (the voting system)
– Trim the small leaf nodes

Ship Delay Example: Post-pruning


Carrier

G6 O3

GT GT
Low
High Low High
Medium Medium
Speed Yes Yes No Speed No
High Medium Low High Low
Medium
Yes Yes Yes Yes Yes Yes

Carrier O3
G6

Q. Optimistic GE Yes GT
Low
and Pessimistic High
Medium
GE in two trees?
No Yes No

10
22/10/2023

Other Practical Issues in DT

• Data fragmentation  Prune decision tree


• Splitting condition  Greedy strategy
• Interpret large-sized trees  Post-pruning
• Tree replication  Create new attributes

• Discrete-valued function  Combined with


regression
• Search strategy  Optimise search
– Bottom-up methods
– Bi-directional

Random Forest in Machine Learning


Random Forest is a robust machine learning algorithm. It
consists of multiple decision trees, which each produce their
own predictions like a committee.

A combination of learning models increases the overall result.

11
22/10/2023

Random Forest Classifier


• Random forest adds additional randomness to the model
– search for the best feature among a random subset of features
– result in a wide diversity
• The hyperparameters in random forest:

 the number of trees;


 the number of features to be considered for each tree;
 the split criterion (e.g. Gini);
 the maximum depth of the trees;
 the minimum number of instances in leaf nodes;

Measuring Classification Model


• The confusion matrix can summarise the test results for a
classification model
Predicted N Predicted P

Actual N True negative False positive


Actual P False negative True positive

• Accuracy = (TP+TN) / Total;


• Error rate = (FP+FN) / Total;
• Precision = TP / (TP + FP);
• Recall = TP / (TP + FN);
• F-score = 2*Precision*Recall / (Precision + Recall)

Precision is true positives divided by the total cases labeled as positive class.
Recall is true positives divided by the total cases that are actually positive class.

12
22/10/2023

Measuring Classification Model


Predicted N Predicted P • TP=30; TN=930; FP=30; and FN=10.
• Accuracy = (TP+TN)/1000 = 0.96.
Actual N TN=930 FP=30
• Error rate = (FP+FN)/1000 = 0.04
Actual P FN=10 TP=30 • Precision = TP / (TP + FP) = 0.5.
• Recall = TP / (TP + FN) = 0.75.
• F-score = 2*0.5*0.75/(0.5+0.75)=0.6

Predicted N Predicted P • TP=20; TN=940; FP=20; and FN=20.


• Accuracy = (TP+TN)/1000 = 0.96.
Actual N TN=940 FP=20
• Error rate = (FP+FN)/1000 = 0.04
Actual P FN=20 TP=20 • Precision = TP / (TP + FP) = 0.5.
• Recall = TP / (TP + FN) = 0.5.
• F-score = 2*0.5*0.5/(0.5+0.5)=0.5

Q: What are the implications of two predictive models and metrics?

Learning Outcomes
• Introduce classification techniques & applications
• Illustrate decision tree classification technique
• Appreciate advs and disadvs of decision tree methods
• Explain Hunt's algorithm
• Determine how to split the records
– Specify the attribute test condition
– Impurity measures: Gini, Entropy, Misclassification Error
– Gain, Information Gain, Gain Ratio
• Determine when to stop splitting
• Understand feature importance & Gini importance
• Discuss practical issues of classification
• Random forest
• Measure classification models
• An application of classification analytics at seaport

13
22/10/2023

Apply classification predictive


analytics to a real case
• Song, D.P. and Xie, Y. (2022). Digitalisation for operational efficiency and
GHG emission reduction at container ports, funded by EPSRC, UK.

Port of Felixstowe

14
22/10/2023

Predicting out-terminals of import containers


at seaports through data analytics

The CRISP-DM Process Model

CRISP=Cross-Industry Standard Process


Mariscal et al (2010). A survey of data mining and knowledge discovery process models and methodologies, Knowledge Engineering
Review, 25, 137-166.

15
22/10/2023

CRISP-DM Phase 1: Business Understanding --


The Need & Objective

• Data from TOS


• Two container yards
• Multiple out-terminals (road and rail out-terminals)
• Miss rail services is costly
• Where to store containers when unloaded from vessels
• Operational efficiency & GHG emissions

“Ports and terminals do not know what the next mode of transport is
when containers are unloaded from vessels, which prevents the terminal
operators from optimising yard operations” -- The secretary general of the
Federation of Private Port Operators

CRISP-DM Phase 1: Business Understanding –


Business-as-Usual
Total containers = D Haulier out-
terminal H0 Haulier 0
D00
Quayside Yard 0

Rail out-
terminal R0 Rail 0

D01

Yard 1 Haulier out-


Haulier 1
terminal H1

Randomly Rail out-


Randomly choose chosen
containers Rail 1
terminal R1
for D00 and D01

16
22/10/2023

CRISP-DM Phase 1: Business Understanding --


Context
• For rail containers, they must be collected from a rail out-
terminal designated by the customers;
• For haulier containers, they will be collected from the
haulier out-terminal near their storage yard determined by
terminal operator in principle;
• There is heavy penalty for containers missing rail services

The cross-yard movements can lead to a higher probability of missing


the rail services and lower productivity of rail-mounted cranes at rail
terminals

CRISP-DM Phase 2: Business Understanding –


Raw Data Samples

17
22/10/2023

CRISP-DM Phase 2: Data Understanding --


Dataset

• The cleaned dataset D = 601,793 records;


• 70% containers moved from Quay -> Yard 0 (D00 = 421297)
• 30% moved from Quay->Yard 1 (D01 = 180496)
• Modal split: 74% by road hauliers, 26% by rail
• For Rail containers, 79% collected from R1, 21% from R0.

• The predictive task: to classify the import containers into


three classes, H (Haulier), R0 and R1.

Clean step: remove rows with missing, meaningless, repeated data.

CRISP-DM Phase 2: Data Understanding –


Predictive Scenario
Haulier out-
terminal H0
D00
Quayside Yard 0

Rail out-
terminal R0

D01

Yard 1 Haulier out-


terminal H1

• Keep D00 and D01 the same as BAU scenario; Rail out-
• Predicted R1 containers (PR1) to Yard 1; terminal R1
• Predicted R0 containers (PR0) to Yard 0;
• Predicted H containers to meet D00/D01.

18
22/10/2023

CRISP-DM Phase 2: Data Understanding --


Features

1. cntr_height_in_feet 13. act_out_owner


2. cntr_length_in_feet 14. gross_weight_measured_yc
3. cntr_width_in_feet 15. gross_weight_measured_wb
4. full_empty_indr 16. general_cargo_content
5. owner 17. reefer
6. SOA 18. out_of_gauge
7. service_code 19. dangerous
8. voyid_vessel_code 20. cms_no
9. voyid_voyage_code 21. SITC
10. origin_port 22. cntr_status
11. load_port 23. bill_of_lading_no
12. gross_weight_documented 24. cntr_id

CRISP-DM Phase 3: Data Preparation – Feature


Selection
• Features that contain repeated information or
invaluable information are removed, e.g.
o cntr_id and bill_of_lading_no are high cardinality
features under which each container is assigned a unique
ID number;
o The feature of cntr_status takes a single value for all the
containers and carries zero-variance.
• Feature selection methods: Filter and Wrapper.

19
22/10/2023

CRISP-DM Phase 3: Data Preparation – Feature


Selection
• Filter method:
 ANOVA and Chi squared test were chosen to examine the association
between the input features and the target variable.
• Wrapper method:
 The Boruta algorithm is a wrapper built around Random Forest
classifier.
• Following Chi squared test, ANOVA test and Boruta algorithm,
8 features are dropped, including

‘full_empty_indr’, ‘SOA’, ‘load_port’, ‘cntr_width_in_feet’,


'gross_weight_measured_wb', ‘gross_weight_measured_yc’, ‘out_of_gauge’
and ‘cms_no’.

CRISP-DM Phase 3: Data Preparation – Feature


Engineering
• The feature “general_cargo_content” is a text document
feature, which is a 1×98,271 vector.
• Feature engineering: to transform original high dimensional
features and create new low dimensional features;
• Standard International Trade Classification (SITC) is a product
classification of the UN used for international trade.
general_cargo_content
CERAMIC TILES
PLASTIC BAGS
DIBASIC ESTER
TRIMETHYL PHOSPHITE
MALEIC ANHYDRIDE
PROPYLENE GLYCOL USP GRADE
TRIMETHYL PHOSPHITE
PAPER, PAPERBOARD, PACKING MATERIAL

20
22/10/2023

CRISP-DM Phase 3: Data Preparation –


Container Content Classification
• Lack of labelled training documents
• Unsupervised text classification using Glove word
embeddings and Cosine Similarity
• Two annotators manually labelled 303 unique content types
transported by 141,427 containers
• Updating classifications based on manually annotated content
• Evaluating classification performance using manually
annotated content
• Accuracy score of 80%
• The content classification process took over two months to
complete

CRISP-DM Phase 4: Modelling – Tree-based


models
• The tree-based models XGBoost and Random Forest were
selected to build classifiers.
• Random Forest is a bagging ensemble machine learning
model that aims to reduce variance
• XGboost is a gradient boosting ensemble model that
reduces both variance and bias.
• The dataset were split into training and test data at 75:25
• A grid search with 5-fold cross validation process was
applied in the training process to tune the
hyperparameters of two classifier models,

21
22/10/2023

CRISP-DM Phase 5: Evaluation – Performance


of Precision, Recall, F1 score
Weighted
Feature set Classifiers Class Precision Recall F1 score F1
R1 0.63 0.44 0.52
With “SITC” Random Forest R0 0.57 0.31 0.40 0.77
H 0.82 0.93 0.87
R1 0.44 0.37 0.40
Without “SITC Random Forest R0 0.32 0.23 0.26 0.70
H 0.79 0.84 0.82
R1 0.64 0.45 0.53
With “SITC” XGBoost R0 0.60 0.33 0.43 0.78
H 0.83 0.92 0.87
R1 0.47 0.33 0.38
Without “SITC” XGBoost R0 0.38 0.20 0.26 0.71
H 0.79 0.88 0.83

Precision = how many retrieved items are relevant = TP/(Predicted positive)


Recall = how many relevant items are retrieved = TP/(Actual positive)

CRISP-DM Phase 5: Evaluation – Confusion


Matrix
Actual R1 Actual R0 Actual H

Predicted R1 a11=0.094 a12=0.006 a13=0.050

Predicted R0 a21=0.004 a22=0.016 a23=0.008

Predicted H0 a31=0.114 a32=0.031 a33=0.676

22
22/10/2023

CRISP-DM Phase 6: Deployment – Prediction


Scenario Analysis
Total import container D is predicted into three classes:
• Predicted to Rail 1: PR1=D*(a11+a12+a13);
• Predicted to Rail 0: PR0=D*(a21+a22+a23);
• Predicted to Haul: PH=D*(a31+a32+a33).
Operational rules:
1. PR1 to Yard 1;
2. From PH, select (D01 – PR1) to Yard 1;
3. Other containers to Yard 0;
4. Hauliers go to correct yard to pick up.

Dh00
Haulier 0
D00 Yard 0
Dr00
Quayside Rail 0
Dr01
D01 Yard 1 Haulier 1

Rail 1

CRISP-DM Phase 6: Deployment – Cost Analysis


Deploy predictive model vs Business-as-Usual scenario

Cyr01= Cyr10 20 30 40 50 60
Cost reduction in Scenario 2 (£) 469280 885553 1301825 1718098 2134371

Cost reduction in Scenario 2 (%) 14.90 21.53 25.63 28.43 30.45

35
30 28.43
30.45

25 25.63

21.53
20
15 14.9

10
5
0
20 30 40 50 60

23
22/10/2023

CRISP-DM Phase 6: Deployment – Sensitivity


Analysis
Cost saving achieved by predictive model

Cyr01 (£) Pred (%) a21 (%) a31 (%) a12 (%) a32 (%) a13 (%) a23 (%)

20 14.9 14.68 12.33 14.6 14.78 14.7 14.91

30 21.53 21.19 17.59 21.07 21.33 21.22 21.54

40 25.63 25.22 20.86 25.07 25.4 25.25 25.64

50 28.43 27.97 23.08 27.8 28.16 28 28.44

60 30.45 29.96 24.68 29.78 30.17 30 30.47

20% increase in the six misclassification errors

CRISP-DM Phase 6: Deployment – Emission


Analysis

Cyr01 BAU Predictive %PredFromBAU

20 1514378 1175012 22.41

60 1514378 1175012 22.41

• Transport distances between yard and out-terminals


• Rail miss ratio from yard
• Rail miss ratio cross yard
• CO2 emission of internal vehicle per km (kg CO2)
• CO2 emission of External truck per trip (kg CO2);

Assume that the missed rail containers will be delivered by external trucks

24
22/10/2023

Summary
• Using the predictive classification model could significantly
save the costs & emissions compared to Business-as-usual.
• When Cyr01(Cyr10) varies from £20 to £60, the predictive model
can reduce the cost by 14.9%~30.45%; emission by 22.41%.
• Misclassification errors are inevitable with virtually any
classification model, and should be measured.
• The results and the model are robust to the misclassification
errors.
• The predictive model is easy to implement with low risk.

Further Reading
• Safavian, S.R., and Landgrebe, D. (1991). A survey of decision tree classifier
methodology, IEEE Transactions on Systems, Man, and Cybernetics, 21(3), 660-674.
• Sharma, H. and Kumar, S. (2016). A Survey on Decision Tree Algorithms of
Classification in Data Mining, International Journal of Science and Research, 5(4),
2094-2097.
• Lin, L.H., Chen, K.K. and Chiu, R.H. (2017). Predicting customer retention likelihood
in the container shipping industry through the decision tree approach, Journal of
Marine Science and Technology, 25(1), 23-33.
• Kulkarni, V.Y., Sinha, P.K. & Petare, M.C. (2016). Weighted hybrid decision tree
model for random forest classifier. J. Inst. Eng. India Ser. B. 97, 209–217.
• Pani, C, Fadda, P., Fancello, G., Frigau, L. and Mola, F. (2014). A data mining
approach to forecast late arrivals in a transhipment container terminal, Transport,
29, 175-184.
• Jung, K., Kashyap, S., Avati, A., et al. (2021). A framework for making predictive
models useful in practice, Journal of the American Medical Informatics Association,
28(6), 1149–1158.

25
22/10/2023

Summary of Classification Modelling


1. What is the purpose of classification?
2. Where it can be applied?
3. What is a decision tree?
4. How to build a decision tree?
5. How to build a good decision tree?
6. How to measure feature importance & Gini importance?
7. What are the practical issues of classification?
8. How to make decision tree classification more robust?
9. How to measure whether a classification model is good?
10. How to apply classification predictive analytics to real case?

Thank you for your attention!


&
Questions?

26

You might also like