EBUS537 Theme4 Week 5
EBUS537 Theme4 Week 5
Partially based on Tan, P.N., Steinbach, M., and Kumar, V. (2014) Introduction to Data Mining, Pearson
Learning Outcomes
• Introduce classification techniques & applications
• Illustrate decision tree classification technique
• Appreciate advs and disadvs of decision tree methods
• Explain Hunt's algorithm
• Determine how to split the records
– Specify the attribute test condition
– Impurity measures: Gini, Entropy, Misclassification Error
– Gain, Information Gain, Gain Ratio
• Determine when to stop splitting
• Understand feature importance & Gini importance
• Discuss practical issues of classification
• Random forest
• Measure classification models
• An application of classification predictive analytics at seaport
1
22/10/2023
2
22/10/2023
X X
3
22/10/2023
Eagle Warm-bld No No No No
4
22/10/2023
Training error = 0%
Test error = 30%.
Humans,
elephants, There is only one training
dolphins records, warm-blooded that do
not hibernate, which is an eagle.
5
22/10/2023
6
22/10/2023
Bat, Whale
Spiny anteater Spiny anteater
7
22/10/2023
For a tree with 30 leaf nodes and 10 errors on training dataset with 1000 instances:
Training error = 10/1000 = 1%
Generalization error = (10 + 30*0.5)/1000 = 2.5%
Tan, P.N., Steinbach, M., and Kumar, V. (2014) Introduction to Data Mining, Pearson
Occam’s Razor
• Occam's razor is the philosophical idea (principle) that, given
two models of similar generalization errors, the simpler
model is preferable to the more complex model.
8
22/10/2023
9
22/10/2023
G6 O3
GT GT
Low
High Low High
Medium Medium
Speed Yes Yes No Speed No
High Medium Low High Low
Medium
Yes Yes Yes Yes Yes Yes
Carrier O3
G6
Q. Optimistic GE Yes GT
Low
and Pessimistic High
Medium
GE in two trees?
No Yes No
10
22/10/2023
11
22/10/2023
Precision is true positives divided by the total cases labeled as positive class.
Recall is true positives divided by the total cases that are actually positive class.
12
22/10/2023
Learning Outcomes
• Introduce classification techniques & applications
• Illustrate decision tree classification technique
• Appreciate advs and disadvs of decision tree methods
• Explain Hunt's algorithm
• Determine how to split the records
– Specify the attribute test condition
– Impurity measures: Gini, Entropy, Misclassification Error
– Gain, Information Gain, Gain Ratio
• Determine when to stop splitting
• Understand feature importance & Gini importance
• Discuss practical issues of classification
• Random forest
• Measure classification models
• An application of classification analytics at seaport
13
22/10/2023
Port of Felixstowe
14
22/10/2023
15
22/10/2023
“Ports and terminals do not know what the next mode of transport is
when containers are unloaded from vessels, which prevents the terminal
operators from optimising yard operations” -- The secretary general of the
Federation of Private Port Operators
Rail out-
terminal R0 Rail 0
D01
16
22/10/2023
17
22/10/2023
Rail out-
terminal R0
D01
• Keep D00 and D01 the same as BAU scenario; Rail out-
• Predicted R1 containers (PR1) to Yard 1; terminal R1
• Predicted R0 containers (PR0) to Yard 0;
• Predicted H containers to meet D00/D01.
18
22/10/2023
19
22/10/2023
20
22/10/2023
21
22/10/2023
22
22/10/2023
Dh00
Haulier 0
D00 Yard 0
Dr00
Quayside Rail 0
Dr01
D01 Yard 1 Haulier 1
Rail 1
Cyr01= Cyr10 20 30 40 50 60
Cost reduction in Scenario 2 (£) 469280 885553 1301825 1718098 2134371
35
30 28.43
30.45
25 25.63
21.53
20
15 14.9
10
5
0
20 30 40 50 60
23
22/10/2023
Cyr01 (£) Pred (%) a21 (%) a31 (%) a12 (%) a32 (%) a13 (%) a23 (%)
Assume that the missed rail containers will be delivered by external trucks
24
22/10/2023
Summary
• Using the predictive classification model could significantly
save the costs & emissions compared to Business-as-usual.
• When Cyr01(Cyr10) varies from £20 to £60, the predictive model
can reduce the cost by 14.9%~30.45%; emission by 22.41%.
• Misclassification errors are inevitable with virtually any
classification model, and should be measured.
• The results and the model are robust to the misclassification
errors.
• The predictive model is easy to implement with low risk.
Further Reading
• Safavian, S.R., and Landgrebe, D. (1991). A survey of decision tree classifier
methodology, IEEE Transactions on Systems, Man, and Cybernetics, 21(3), 660-674.
• Sharma, H. and Kumar, S. (2016). A Survey on Decision Tree Algorithms of
Classification in Data Mining, International Journal of Science and Research, 5(4),
2094-2097.
• Lin, L.H., Chen, K.K. and Chiu, R.H. (2017). Predicting customer retention likelihood
in the container shipping industry through the decision tree approach, Journal of
Marine Science and Technology, 25(1), 23-33.
• Kulkarni, V.Y., Sinha, P.K. & Petare, M.C. (2016). Weighted hybrid decision tree
model for random forest classifier. J. Inst. Eng. India Ser. B. 97, 209–217.
• Pani, C, Fadda, P., Fancello, G., Frigau, L. and Mola, F. (2014). A data mining
approach to forecast late arrivals in a transhipment container terminal, Transport,
29, 175-184.
• Jung, K., Kashyap, S., Avati, A., et al. (2021). A framework for making predictive
models useful in practice, Journal of the American Medical Informatics Association,
28(6), 1149–1158.
25
22/10/2023
26