CH 8 Data Mining
CH 8 Data Mining
— Chapter 8 —
What is Classification
◼ A bank loans officer needs analysis of her data to
learn which loan applicants are “safe” and which are
“risky” for the bank.
◼ A marketing manager at AllElectronics needs data
analysis to help guess whether a customer with a
given profile will buy a new computer. (Yes/No)
◼ A medical researcher wants to analyze breast cancer
data to predict which one of three specific
treatments a patient should receive. (A/B/C)
◼ In each of these examples, the data analysis task is
classification, where a model or classifier is
constructed to predict class (categorical) labels,
What is Prediction
◼ Suppose that the marketing manager wants to
predict how much a given customer will spend during
a sale at AllElectronics.
◼ This data analysis task is an example of numeric
prediction, where the model constructed predicts a
continuous-valued function, or ordered value, as
opposed to a class label.
◼ This model is a predictor. Regression analysis is a
statistical methodology that is most often used for
numeric prediction
Classification—A Two-Step Process
◼ Model construction: describing a set of predetermined classes
◼ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
◼ The set of tuples used for model construction is training set
Classification
Algorithms
Training
Data
N A M E R A N K Y E A R S TE N U R E D Classifier
M ik e A s s is t a n t P r o f 3 no (Model)
M a ry A s s is t a n t P r o f 7 yes
B ill P ro fe s s o r 2 yes
J im A s s o c ia t e P r o f 7 yes
IF rank = ‘professor’
D ave A s s is t a n t P r o f 6 no
OR years > 6
A nne A s s o c ia t e P r o f 3 no
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARSTENURED
Tom Assistant Prof 2 no Tenured?
M erlisa AssociateProf 7 no
GeorgeProfessor 5 yes
Joseph Assistant Prof 7 yes
Supervised vs. Unsupervised Learning
no yes yes
Why decision tree
◼ The construction of decision tree classifiers does not require
any domain knowledge or parameter setting, and therefore is
appropriate for exploratory knowledge discovery.
◼ Decision trees can handle multidimensional data. Their
representation of acquired knowledge in tree form is intuitive
and generally easy to assimilate by humans.
◼ The learning and classification steps of decision tree induction
are simple and fast. In general, decision tree classifiers have
good accuracy. However, successful use may depend on the
data at hand. Decision tree induction algorithms have been
used for classification in many application areas such as
medicine, manufacturing and production, financial analysis,
astronomy, and molecular biology. Decision trees are the basis
of several commercial rule induction systems.
Concepts in leaning decision tree
◼ Attribute selection measures are used to select the attribute
that best partitions the tuples into distinct classes.
◼ When decision trees are built, many of the branches may reflect
noise or outliers in the training data. Tree pruning attempts to
identify and remove such branches, with the goal of improving
classification accuracy on unseen data.
◼ Scalability is a big issues for the induction of decision trees from
large databases
Tree algorithms
v
| Dj |
Info A ( D ) = Info( D j )
j =1 |D|
v
| Dj |
Info A ( D ) = Info( D j )
j =1 |D|
◼ Reduction in Impurity:
gini( A) = gini( D) − giniA ( D)
◼ The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
Computation of Gini Index
◼ Ex. D has 9 tuples in buys_computer = “yes”
2
and
2
5 in “no”
9 5
gini ( D) = 1 − − = 0.459
14 14
◼ Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2 giniincome{low,medium} ( D) = 10 Gini( D1 ) + 4 Gini( D1 )
14 14