0% found this document useful (0 votes)
0 views

3. Decision Tree -1.Pptx

The document discusses supervised learning, particularly in applications such as predicting high-risk patients in emergency rooms and approving credit card applications. It explains the process of learning from labeled data to create classification models, emphasizing the importance of decision trees and information theory in this context. The document also highlights the distinction between supervised and unsupervised learning, detailing how decision trees are built and evaluated using measures like information gain and entropy.

Uploaded by

esha.sangra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

3. Decision Tree -1.Pptx

The document discusses supervised learning, particularly in applications such as predicting high-risk patients in emergency rooms and approving credit card applications. It explains the process of learning from labeled data to create classification models, emphasizing the importance of decision trees and information theory in this context. The document also highlights the distinction between supervised and unsupervised learning, detailing how decision trees are built and evaluated using measures like information gain and entropy.

Uploaded by

esha.sangra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

SUPERVISED LEARNING

An example application
●2

◻ An emergency room in a hospital measures 17 variables


(e.g., blood pressure, age, etc) of newly admitted
patients.
◻ A decision is needed: whether to put a new patient in an
intensive-care unit.
◻ Due to the high cost of ICU, those patients who may
survive less than a month are given higher priority.
◻ Problem: to predict high-risk patients and discriminate
them from low-risk patients.
Another application
●3

◻ A credit card company receives thousands of applications


for new cards. Each application contains information about
an applicant,
age
Marital status
annual salary
outstanding debts
credit rating
etc.
◻ Problem: to decide whether an application should be
approved, or to classify applications into two categories,
approved and not approved.
Machine learning
●4

◻ Like human learning from past experiences.


◻ A computer does not have “experiences”.
◻ A computer system learns from data, which represent
some “past experiences” of an application domain.
◻ Our focus: learn a target function that can be used to
predict the values of a discrete class attribute, e.g.,
approve or not-approved, and high-risk or low risk.
◻ The task is commonly called: Supervised learning,
classification, or inductive learning.
The data and the goal
●5

◻ Data: A set of data records (also called examples,


instances or cases) described by
k attributes: A1, A2, … Ak.
a class: Each example is labelled with a pre-defined
class.
◻ Goal: To learn a classification model from the data
that can be used to predict the classes of new
(future, or test) cases/instances.
An example: data (loan application)
Approved or not
●6
An example: the learning task
◻ Learn a classification model from the data
◻ Use the model to classify future loan applications into
Yes (approved) and
No (not approved)
◻ What is the class for following case/instance?

●7
Supervised vs. unsupervised Learning
●8

◻ Supervised learning: classification is seen as


supervised learning from examples.
Supervision: The data (observations, measurements, etc.)
are labeled with pre-defined classes. It is like that a
“teacher” gives the classes (supervision).
Test data are classified into these classes too.
◻ Unsupervised learning (clustering)
Class labels of the data are unknown
Given a set of data, the task is to establish the existence
of classes or clusters in the data
Supervised learning process
●9

■ Learning (training): Learn a model using the


training data
■ Testing: Test the model using unseen test data
to assess the model accuracy
What do we mean by learning?
●10

◻ Given
a data set D,
a task T, and
a performance measure M,
a computer system is said to learn from D to
perform the task T if after learning the system’s
performance on T improves as measured by M.
◻ In other words, the learned model helps the system
to perform T better as compared to no learning.
An example
●11

◻ Data: Loan application data


◻ Task: Predict whether a loan should be approved or
not.
◻ Performance measure: accuracy.

No learning: classify all future applications (test data)


to the majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
◻ We can do better than 60% with learning.
Fundamental assumption of learning
●12

Assumption: The distribution of training examples is


identical to the distribution of test examples
(including future unseen examples).

◻ In practice, this assumption is often violated to


certain degree.
◻ Strong violations will clearly result in poor
classification accuracy.
◻ To achieve good accuracy on the test data, training
examples must be sufficiently representative of the
test data.
Decision Tree
●13

◻ Decision tree learning is one of the most widely


used techniques for classification.
Its classification accuracy is competitive with other
methods, and
it is very efficient.
◻ The classification model is a tree, called decision
tree.
◻ C4.5 by Ross Quinlan is perhaps the best known
system. It can be downloaded from the Web.
The loan data (reproduced)
Approved or not
●14
A decision tree from the loan data
●15

■ Decision nodes and leaf nodes (classes)


Use the decision tree
●16

No
Is the decision tree unique?
●17

■ No. Here is a simpler tree.


■ We want smaller tree and accurate tree.
■ Easy to understand and perform better.

■ Finding the best tree is


NP-hard.
■ All current tree building
algorithms are heuristic
algorithms
From a decision tree to a set of rules
●18

■ A decision tree can


be converted to a
set of rules
■ Each path from the
root to a leaf is a
rule.
Algorithm for decision tree learning
●19

◻ Basic algorithm (a greedy divide-and-conquer algorithm)


Assume attributes are categorical now (continuous attributes can be
handled too)
Tree is constructed in a top-down recursive manner
At start, all the training examples are at the root
Examples are partitioned recursively based on selected attributes
Attributes are selected on the basis of an impurity function (e.g.,
information gain)
◻ Conditions for stopping partitioning
All examples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority
class is the leaf
There are no examples left
Decision tree learning algorithm
●20
Choose an attribute to partition data
●21

◻ The key to building a decision tree - which attribute to


choose in order to branch.
◻ The objective is to reduce impurity or uncertainty in
data as much as possible.
A subset of data is pure if all instances belong to the
same class.
◻ The heuristic in C4.5 is to choose the attribute with the
maximum Information Gain or Gain Ratio based on
information theory.
The loan data (reproduced)
Approved or not
●22
Two possible roots, which is better?
●23

■ Fig. (B) seems to be better.


Information theory
●24

◻ Information theory provides a mathematical basis for


measuring the information content.
◻ To understand the notion of information, think about it
as providing the answer to a question, for example,
whether a coin will come up heads.
If one already has a good guess about the answer, then the
actual answer is less informative.
If one already knows that the coin is rigged so that it will
come with heads with probability 0.99, then a message
(advanced information) about the actual outcome of a flip
is worth less than it would be for a honest coin (50-50).
Information theory (cont …)
●25

◻ For a fair (honest) coin, you have no information,


and you are willing to pay more (say in terms of $)
for advanced information - less you know, the more
valuable the information.
◻ Information theory uses this same intuition, but
instead of measuring the value for information in
dollars, it measures information contents in bits.
◻ One bit of information is enough to answer a
yes/no question about which one has no idea, such
as the flip of a fair coin
Information theory: Entropy measure
◻ The entropy formula,

◻ Pr(cj) is the probability of class cj in data set D


◻ We use entropy as a measure of impurity or disorder of
data set D. (Or, a measure of information in a tree)

●26
Entropy measure: let us get a feeling
●27

■ As the data become purer and purer, the entropy value


becomes smaller and smaller. This is useful to us!
Information gain
◻ Given a set of examples D, we first compute its entropy:

◻ If we make attribute Ai, with v values, the root of the


current tree, this will partition D into v subsets D1, D2 …, Dv
. The expected entropy if Ai is used as the current root:

●28
Information gain (cont …)
◻ Information gained by selecting attribute Ai to branch
or to partition the data is

◻ We choose the attribute with the highest gain to


branch/split the current tree.

●29
An example

■ Own_house is the best


choice for the root.
●30
We build the final tree
●31

■ We can use information gain ratio to evaluate the


impurity as well

You might also like