0% found this document useful (0 votes)
2 views

CH 8 Data Mining

The document discusses classification and prediction in data mining, highlighting their applications in various fields such as banking, marketing, and medicine. It explains the two-step process of classification involving model construction and model usage, as well as the concepts of supervised and unsupervised learning. Additionally, it covers decision tree induction, attribute selection measures, and the comparison of different attribute selection methods like information gain and Gini index.

Uploaded by

onlydodie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CH 8 Data Mining

The document discusses classification and prediction in data mining, highlighting their applications in various fields such as banking, marketing, and medicine. It explains the two-step process of classification involving model construction and model usage, as well as the concepts of supervised and unsupervised learning. Additionally, it covers decision tree induction, attribute selection measures, and the comparison of different attribute selection methods like information gain and Gini index.

Uploaded by

onlydodie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Data Mining:

— Chapter 8 —
What is Classification
◼ A bank loans officer needs analysis of her data to
learn which loan applicants are “safe” and which are
“risky” for the bank.
◼ A marketing manager at AllElectronics needs data
analysis to help guess whether a customer with a
given profile will buy a new computer. (Yes/No)
◼ A medical researcher wants to analyze breast cancer
data to predict which one of three specific
treatments a patient should receive. (A/B/C)
◼ In each of these examples, the data analysis task is
classification, where a model or classifier is
constructed to predict class (categorical) labels,
What is Prediction
◼ Suppose that the marketing manager wants to
predict how much a given customer will spend during
a sale at AllElectronics.
◼ This data analysis task is an example of numeric
prediction, where the model constructed predicts a
continuous-valued function, or ordered value, as
opposed to a class label.
◼ This model is a predictor. Regression analysis is a
statistical methodology that is most often used for
numeric prediction
Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
◼ The set of tuples used for model construction is training set

◼ The model is represented as classification rules, decision trees, or


mathematical formulae
◼ Model usage: for classifying future or unknown objects
◼ Estimate accuracy of the model

◼ The known label of test sample is compared with the classified

result from the model


◼ Accuracy rate is the percentage of test set samples that are

correctly classified by the model


◼ Test set is independent of training set (otherwise overfitting)

◼ If the accuracy is acceptable, use the model to classify data tuples


whose class labels are not known
Learning and model construction
Terminology
◼ Training dataset
◼ Attribute vector
◼ Class label attribute
◼ Training sample/example/instance/object
Test and Classification

◼ Classification: Test data are used to estimate the accuracy of the


classification rules. If the accuracy is considered acceptable, the rules can be
applied to the classification of new data tuples.
Terminology
◼ Test dataset
◼ Test samples
◼ Accuracy of the model
◼ Overfit (optimistic estimation of accuracy)
Process (1): Model Construction

Classification
Algorithms
Training
Data

N A M E R A N K Y E A R S TE N U R E D Classifier
M ik e A s s is t a n t P r o f 3 no (Model)
M a ry A s s is t a n t P r o f 7 yes
B ill P ro fe s s o r 2 yes
J im A s s o c ia t e P r o f 7 yes
IF rank = ‘professor’
D ave A s s is t a n t P r o f 6 no
OR years > 6
A nne A s s o c ia t e P r o f 3 no
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARSTENURED
Tom Assistant Prof 2 no Tenured?
M erlisa AssociateProf 7 no
GeorgeProfessor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised Learning

◼ Supervised learning (classification)


◼ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
◼ New data is classified based on the training set
◼ Unsupervised learning (clustering)
◼ The class labels of training data is unknown
◼ Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
Decision Tree
Terminology
◼ Decision tree induction is the learning of decision
trees from class-labeled training tuples.
◼ A decision tree is a flowchart-like tree structure,
◼ where each internal node (nonleaf node) denotes a
test on an attribute,
◼ Each branch represents an outcome of the test,
◼ and each leaf node (or terminal node) holds a class
label.
◼ The topmost node in a tree is the root node.
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
❑ Training data set: Buys_computer <=30 high no excellent no
❑ The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
>40 low yes excellent no
❑ Resulting tree:
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair

no yes yes
Why decision tree
◼ The construction of decision tree classifiers does not require
any domain knowledge or parameter setting, and therefore is
appropriate for exploratory knowledge discovery.
◼ Decision trees can handle multidimensional data. Their
representation of acquired knowledge in tree form is intuitive
and generally easy to assimilate by humans.
◼ The learning and classification steps of decision tree induction
are simple and fast. In general, decision tree classifiers have
good accuracy. However, successful use may depend on the
data at hand. Decision tree induction algorithms have been
used for classification in many application areas such as
medicine, manufacturing and production, financial analysis,
astronomy, and molecular biology. Decision trees are the basis
of several commercial rule induction systems.
Concepts in leaning decision tree
◼ Attribute selection measures are used to select the attribute
that best partitions the tuples into distinct classes.
◼ When decision trees are built, many of the branches may reflect
noise or outliers in the training data. Tree pruning attempts to
identify and remove such branches, with the goal of improving
classification accuracy on unseen data.
◼ Scalability is a big issues for the induction of decision trees from
large databases
Tree algorithms

◼ ID3 (Iterative Dichotomiser): J. Ross Quinlan, a researcher in


machine learning, developed a decision tree algorithm
◼ C4.5(a successor of ID3)
◼ CART(Classification and Regression Trees )
Attribute Selection Measure:
Information Gain (ID3/C4.5)
◼ Select the attribute with the highest information gain
◼ Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
◼ Expected information (entropy) needed to classify a tuple in D:
m
Info( D ) = −  pi log2 ( pi )
i =1

◼ Information needed (after using A to split D into v partitions) to


classify D: v
| Dj |
Info A ( D ) =   Info( D j )
j =1 | D |

◼ Information gained by branching on attribute A


Gain(A) = Info(D) − Info A(D)
Attribute Selection: Information Gain
 Class P: buys_computer = “yes”
 Class N: buys_computer = “no”
9 9 5 5
Info( D ) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940
14 14 14 14

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D ) = I (2,3) + I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D ) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni)
5
<=30 2 3 0.971 I (2,3) means “age <=30” has 5 out of
31…40 4 0 0 14 14 samples, with 2 yes’es and 3
no’s.
>40 3 2 0.971
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Attribute Selection: Information Gain
 Class P: buys_computer = “yes” 5 4
Infoage ( D ) = I (2,3) + I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D ) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940 + I (3,2) = 0.694
14 14 14 14 14
age pi ni I(pi, ni)
<=30 2 3 0.971 Gain(age) = Info( D) − Infoage ( D) = 0.246
31…40 4 0 0
>40 3 2 0.971
age income student credit_rating buys_computer Gain(income) = 0.029
<=30 high no fair no
<=30
31…40
high
high
no
no
excellent
fair
no
yes
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
m
Info( D ) = −  pi log2 ( pi )
i =1

v
| Dj |
Info A ( D ) =   Info( D j )
j =1 |D|

Gain(A) = Info(D) − Info A(D)


◼ Conditions for stopping
partitioning
◼ All samples for a given node

belong to the same class


◼ There are no remaining

attributes for further


partitioning – majority voting
is employed for classifying the
leaf
◼ There are no samples left
m
Info( D ) = −  pi log2 ( pi )
i =1

v
| Dj |
Info A ( D ) =   Info( D j )
j =1 |D|

Gain(A) = Info(D) − Info A(D)


Gini Index (CART, IBM IntelligentMiner)
◼ If a data set D contains examples from n classes, gini index,
gini(D) is defined as n
gini( D) = 1−  p2 j
j =1
where pj is the relative frequency of class j in D
◼ If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as

◼ Reduction in Impurity:
gini( A) = gini( D) − giniA ( D)
◼ The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
Computation of Gini Index
◼ Ex. D has 9 tuples in buys_computer = “yes”
2
and
2
5 in “no”
9 5
gini ( D) = 1 −   −   = 0.459
 14   14 
◼ Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2 giniincome{low,medium} ( D) =  10 Gini( D1 ) +  4 Gini( D1 )
 14   14 

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the


{low,medium} (and {high}) since it has the lowest Gini index
◼ All attributes are assumed continuous-valued
◼ May need other tools, e.g., clustering, to get the possible split
values
◼ Can be modified for categorical attributes
Comparing Attribute Selection Measures

◼ The three measures, in general, return good results but


◼ Information gain:
◼ biased towards multivalued attributes
◼ Gain ratio:
◼ tends to prefer unbalanced splits in which one partition is
much smaller than the others
◼ Gini index:
◼ biased to multivalued attributes
◼ has difficulty when # of classes is large
◼ tends to favor tests that result in equal-sized partitions
and purity in both partitions
Rainforest: Training Set and Its AVC Sets

Training Examples AVC-set on Age AVC-set on income


age income studentcredit_rating
buys_computerAge Buy_Computer income Buy_Computer

<=30 high no fair no yes no


<=30 high no excellent no yes no
high 2 2
31…40 high no fair yes <=30 2 3
31..40 4 0 medium 4 2
>40 medium no fair yes
>40 low yes fair yes >40 3 2 low 3 1
>40 low yes excellent no
31…40 low yes excellent yes
AVC-set on
<=30 medium no fair no AVC-set on Student
credit_rating
<=30 low yes fair yes
student Buy_Computer
>40 medium yes fair yes Credit
Buy_Computer

<=30 medium yes excellent yes yes no rating yes no


31…40 medium no excellent yes yes 6 1 fair 6 2
31…40 high yes fair yes no 3 4 excellent 3 3
>40 medium no excellent no
Rule Extraction from a Decision Tree
age?

<=30 31..40 >40


◼ Rules are easier to understand than large treesstudent? credit rating?
yes
◼ One rule is created for each path from the root no yes excellent fair
to a leaf no yes yes
◼ Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
◼ Rules are mutually exclusive and exhaustive
◼ Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes

You might also like