Decision Tree
Decision Tree
A decision tree is a decision support hierarchical model that uses a tree-like model of
decisions and their possible consequences, including chance event outcomes, resource
costs, and utility. It is one way to display an algorithm that only contains conditional
control statements.
Overview
A decision tree is a flowchart-like structure in which each internal node represents a
"test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch
represents the outcome of the test, and each leaf node represents a class label
Traditionally, decision trees have
(decision taken after computing all attributes). The paths from root to leaf represent
been created manually.
classification rules.
In decision analysis, a decision tree and the closely related influence diagram are used
as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are
calculated.
Decision trees are commonly used in operations research and operations management. If, in practice, decisions have to be
taken online with no recall under incomplete knowledge, a decision tree should be paralleled by a probability model as a best
choice model or online selection model algorithm. Another use of decision trees is as a descriptive means for calculating
conditional probabilities.
Decision trees, influence diagrams, utility functions, and other decision analysis tools and methods are taught to
undergraduate students in schools of business, health economics, and public health, and are examples of operations research
or management science methods.
Decision-tree elements
Drawn from left to right, a decision tree has only burst nodes (splitting paths) but no sink nodes (converging paths). So used
manually they can grow very big and are then often hard to draw fully by hand. Traditionally, decision trees have been
created manually – as the aside example shows – although increasingly, specialized software is employed.
Decision rules
The decision tree can be linearized into decision rules,[3] where the outcome is the contents of the leaf node, and the
conditions along the path form a conjunction in the if clause. In general, the rules have the form:
Decision rules can be generated by constructing association rules with the target variable on the right. They can also denote
temporal or causal relations.[4]
Commonly a decision tree is drawn using flowchart symbols as it is easier for many to read and understand. Note there is a
conceptual error in the "Proceed" calculation of the tree shown below; the error relates to the calculation of "costs" awarded
in a legal action.
Analysis example
Analysis can take into account the decision maker's (e.g., the company's) preference or utility function, for example:
The basic interpretation in this situation is that the company prefers B's risk and payoffs under realistic risk preference
coefficients (greater than $400K—in that range of risk aversion, the company would need to model a third strategy, "Neither
A nor B").
Another example, commonly used in operations research courses, is the distribution of lifeguards on beaches (a.k.a. the
"Life's a Beach" example).[5] The example describes two beaches with lifeguards to be distributed on each beach. There is
maximum budget B that can be distributed among the two beaches (in total), and using a marginal returns table, analysts can
decide how many lifeguards to allocate to each beach.
Lifeguards on each beach Drownings prevented in total, beach #1 Drownings prevented in total, beach #2
1 3 1
2 0 4
In this example, a decision tree can be drawn to illustrate the principles of diminishing returns on beach #1.
The decision tree illustrates that when sequentially distributing lifeguards, placing a first lifeguard on beach #1 would be
optimal if there is only the budget for 1 lifeguard. But if there is a budget for two guards, then placing both on beach #2
would prevent more overall drownings.
Lifeguards
Influence diagram
Much of the information in a decision tree can be represented more compactly as an influence diagram, focusing attention on
the issues and relationships between events.
The rectangle on the left represents a decision, the ovals
represent actions, and the diamond represents results.
Are simple to understand and interpret. People are able to understand decision tree models after a brief
explanation.
Have value even with little hard data. Important insights can be generated based on experts describing a
situation (its alternatives, probabilities, and costs) and their preferences for outcomes.
Help determine worst, best, and expected values for different scenarios.
Use a white box model. If a given result is provided by a model.
Can be combined with other decision techniques.
The action of more than one decision-maker can be considered.
They are unstable, meaning that a small change in the data can lead to a large change in the structure of the
optimal decision tree.
They are often relatively inaccurate. Many other predictors perform better with similar data. This can be
remedied by replacing a single decision tree with a random forest of decision trees, but a random forest is
not as easy to interpret as a single decision tree.
For data including categorical variables with different numbers of levels, information gain in decision trees is
biased in favor of those attributes with more levels.[8]
Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are
linked.
Runtime issues
Decrease in accuracy in general
Pure node splits while going deeper can cause issues.
The ability to test the differences in classification results when changing D is imperative. We must be able to easily change
and test the variables that could affect the accuracy and reliability of the decision tree-model.
The node splitting function used can have an impact on improving the accuracy of the decision tree. For example, using the
information-gain function may yield better results than using the phi function. The phi function is known as a measure of
“goodness” of a candidate split at a node in the decision tree. The information gain function is known as a measure of the
“reduction in entropy”. In the following, we will build two decision trees. One decision tree will be built using the phi
function to split the nodes and one decision tree will be built using the information gain function to split the nodes.
The main advantages and disadvantages of information gain and phi function
One major drawback of information gain is that the feature that is chosen as the next node in the tree tends to
have more unique values.[11]
An advantage of information gain is that it tends to choose the most impactful features that are close to the
root of the tree. It is a very good measure for deciding the relevance of some features.
The phi function is also a good measure for deciding the relevance of some features based on "goodness".
This is the information gain function formula. The formula states the information gain is a function of the entropy of a node of
the decision tree minus the entropy of a candidate split at node t of a decision tree.
This is the phi function formula. The phi function is maximized when the chosen feature splits the samples in a way that
produces homogenous splits and have around the same number of samples in each split.
We will set D, which is the depth of the decision tree we are building, to three (D = 3). We also have the following data set of
cancer and non-cancer samples and the mutation features that the samples either have or do not have. If a sample has a feature
mutation then the sample is positive for that mutation, and it will be represented by one. If a sample does not have a feature
mutation then the sample is negative for that mutation, and it will be represented by zero.
To summarize, C stands for cancer and NC stands for non-cancer. The letter M stands for mutation, and if a sample has a
particular mutation it will show up in the table as a one and otherwise zero.
The sample data
M1 M2 M3 M4 M5
C1 0 1 0 1 1
NC1 0 0 0 0 0
NC2 0 0 1 1 0
NC3 0 0 0 0 0
C2 1 1 1 1 1
NC4 0 0 0 1 0
Now, we can use the formulas to calculate the phi function values and information gain values for each M in the dataset.
Once all the values are calculated the tree can be produced. The first thing to be done is to select the root node. In information
gain and the phi function we consider the optimal split to be the mutation that produces the highest value for information gain
or the phi function. Now assume that M1 has the highest phi function value and M4 has the highest information gain value.
The M1 mutation will be the root of our phi function tree and M4 will be the root of our information gain tree. You can
observe the root nodes below
Now, once we have chosen the root node we can split the samples into two groups based on whether a sample is positive or
negative for the root node mutation. The groups will be called group A and group B. For example, if we use M1 to split the
samples in the root node we get NC2 and C2 samples in group A and the rest of the samples NC4, NC3, NC1, C1 in group
B.
Disregarding the mutation chosen for the root node, proceed to place the next best features that have the highest values for
information gain or the phi function in the left or right child nodes of the decision tree. Once we choose the root node and the
two child nodes for the tree of depth = 3 we can just add the leaves. The leaves will represent the final classification decision
the model has produced based on the mutations a sample either has or does not have. The left tree is the decision tree we
obtain from using information gain to split the nodes and the right tree is what we obtain from using the phi function to split
the nodes.
Now assume the classification results from both trees are given using a confusion matrix.
Predicted: C Predicted: NC
Actual: C 1 1
Actual: NC 0 4
Predicted: C Predicted: NC
Actual: C 2 0
Actual: NC 1 3
The tree using information gain has the same results when using the phi function when calculating the accuracy. When we
classify the samples based on the model using information gain we get one true positive, one false positive, zero false
negatives, and four true negatives. For the model using the phi function we get two true positives, zero false positives, one
false negative, and three true negatives. The next step is to evaluate the effectiveness of the decision tree using some key
metrics that will be discussed in the evaluating a decision tree section below. The metrics that will be discussed below can
help determine the next steps to be taken when optimizing the decision tree.
Other techniques
The above information is not where it ends for building and optimizing a decision tree. There are many techniques for
improving the decision tree classification models we build. One of the techniques is making our decision tree model from a
bootstrapped dataset. The bootstrapped dataset helps remove the bias that occurs when building a decision tree model with
the same data the model is tested with. The ability to leverage the power of random forests can also help significantly improve
the overall accuracy of the model being built. This method generates many decisions from many decision trees and tallies up
the votes from each decision tree to make the final classification. There are many techniques, but the main objective is to test
building your decision tree model in different ways to make sure it reaches the highest performance level possible.
Let us take the confusion matrix below. The confusion matrix shows us the decision tree model classifier built gave 11 true
positives, 1 false positive, 45 false negatives, and 105 true negatives.
Predicted: C Predicted: NC
Actual: C 11 45
Actual: NC 1 105
We will now calculate the values accuracy, sensitivity, specificity, precision, miss rate, false discovery rate, and false omission
rate.
Accuracy:
Sensitivity (TPR – true positive rate):[12]
Once we have calculated the key metrics we can make some initial conclusions on the performance of the decision tree model
built. The accuracy that we calculated was 71.60%. The accuracy value is good to start but we would like to get our models
as accurate as possible while maintaining the overall performance. The sensitivity value of 19.64% means that out of
everyone who was actually positive for cancer tested positive. If we look at the specificity value of 99.06% we know that out
of all the samples that were negative for cancer actually tested negative. When it comes to sensitivity and specificity it is
important to have a balance between the two values ,so if we can decrease our specificity to increase the sensitivity that
would prove to be beneficial.[13] These are just a few examples on how to use these values and the meanings behind them to
evaluate the decision tree model and improve upon the next iteration.
See also
Behavior tree (artificial intelligence, robotics and control) – control method
Boosting (machine learning) – Method in machine learning
Decision cycle – Sequence of steps for decision-making
Decision list
Decision matrix
Decision table – concise visual representation for specifying which actions to perform depending on given
conditions
Decision tree model – Model of computational complexity of computation
Design rationale – explicit documentation of the reasons behind decisions made when designing a system
or artifact
DRAKON – Algorithm mapping tool
Markov chain – Random process independent of past history
Random forest – Binary search tree based ensemble machine learning method
Ordinal priority approach – Multiple-criteria decision analysis method
Odds algorithm – Method of computing optimal strategies for last-success problems
Topological combinatorics
Truth table – Mathematical table used in logic
References
1. von Winterfeldt, Detlof; Edwards, Ward (1986). "Decision trees". Decision Analysis and Behavioral
Research. Cambridge University Press. pp. 63–89. ISBN 0-521-27304-8.
2. Kamiński, B.; Jakubczyk, M.; Szufel, P. (2017). "A framework for sensitivity analysis of decision trees" (https://
www.ncbi.nlm.nih.gov/pmc/articles/PMC5767274). Central European Journal of Operations Research. 26
(1): 135–159. doi:10.1007/s10100-017-0479-6 (https://doi.org/10.1007%2Fs10100-017-0479-6).
PMC 5767274 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5767274). PMID 29375266 (https://pubmed.n
cbi.nlm.nih.gov/29375266).
3. Quinlan, J. R. (1987). "Simplifying decision trees". International Journal of Man-Machine Studies. 27 (3):
221–234. CiteSeerX 10.1.1.18.4267 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.4267).
doi:10.1016/S0020-7373(87)80053-6 (https://doi.org/10.1016%2FS0020-7373%2887%2980053-6).
4. K. Karimi and H.J. Hamilton (2011), "Generation and Interpretation of Temporal Decision Rules (https://arxiv.
org/abs/1004.3334)", International Journal of Computer Information Systems and Industrial Management
Applications, Volume 3
5. Wagner, Harvey M. (1 September 1975). Principles of Operations Research: With Applications to Managerial
Decisions (https://archive.org/details/principlesofoper00wagn) (2nd ed.). Englewood Cliffs, NJ: Prentice Hall.
ISBN 9780137095926.
6. R. Quinlan, "Learning efficient classification procedures" (https://link.springer.com/chapter/10.1007%2F978-
3-662-12405-5_15#page-1), Machine Learning: an artificial intelligence approach, Michalski, Carbonell &
Mitchell (eds.), Morgan Kaufmann, 1983, p. 463–482. doi:10.1007/978-3-662-12405-5_15 (https://doi.org/10.
1007%2F978-3-662-12405-5_15)
7. Utgoff, P. E. (1989). Incremental induction of decision trees. Machine learning, 4(2), 161–186.
doi:10.1023/A:1022699900025 (https://doi.org/10.1023%2FA%3A1022699900025)
8. Deng,H.; Runger, G.; Tuv, E. (2011). Bias of importance measures for multi-valued attributes and solutions (h
ttps://www.researchgate.net/publication/221079908). Proceedings of the 21st International Conference on
Artificial Neural Networks (ICANN).
9. Larose, Chantal, Daniel (2014). Discovering Knowledge in Data. Hoboken, NJ: John Wiley & Sons. p. 167.
ISBN 9780470908747.
10. Plapinger, Thomas (29 July 2017). "What is a Decision Tree?" (https://towardsdatascience.com/what-is-a-de
cision-tree-22975f00f3e1). Towards Data Science. Archived (https://web.archive.org/web/20211210231954/
https://towardsdatascience.com/what-is-a-decision-tree-22975f00f3e1) from the original on 10 December
2021. Retrieved 5 December 2021.
11. Tao, Christopher (6 September 2020). "Do Not Use Decision Tree Like Thus" (https://towardsdatascience.co
m/do-not-use-decision-tree-like-this-369769d6104d). Towards Data Science. Archived (https://web.archive.o
rg/web/20211210231951/https://towardsdatascience.com/do-not-use-decision-tree-like-this-369769d6104d)
from the original on 10 December 2021. Retrieved 10 December 2021.
12. "False Positive Rate | Split Glossary" (https://www.split.io/glossary/false-positive-rate/). Split. Retrieved
10 December 2021.
13. "Sensitivity vs Specificity" (https://www.technologynetworks.com/analysis/articles/sensitivity-vs-specificity-31
8222). Analysis & Separations from Technology Networks. Retrieved 10 December 2021.
External links
Extensive Decision Tree tutorials and examples (http://www.public.asu.edu/~kirkwood/DAStuff/refs/decisiont
rees/index.html)
Gallery of example decision trees (https://github.com/SilverDecisions/SilverDecisions/wiki/Gallery)
Gradient Boosted Decision Trees (https://blog.datarobot.com/gradient-boosted-regression-trees)