Data Mining: Concepts and Techniques: - Chapter 6
Data Mining: Concepts and Techniques: - Chapter 6
Jianlin Cheng
Department of Computer Science
University of Missouri
Slides Adapted from
©2006 Jiawei Han and Micheline Kamber, All rights reserved
Target marketing
Medical diagnosis
Fraud detection
will occur
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
November 22, 2019 Data Mining: Concepts and Techniques 4
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
November 22, 2019 Data Mining: Concepts and Techniques 6
Supervised vs. Unsupervised Learning
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Accuracy
classifier accuracy: predicting class label
age?
<=30 overcast
31..40 >40
no yes yes
manner
At start, all the training examples are at the root
discretized in advance)
Examples are partitioned recursively based on selected attributes
i 1
Class B
Class A
X
Class C
Y
November 22, 2019 Data Mining: Concepts and Techniques 17
Decision Boundary ?
Class B
Class A
X
Class C
Y
November 22, 2019 Data Mining: Concepts and Techniques 18
Decision Boundary ?
Class B
Class A
X
Class C
Y
November 22, 2019 Data Mining: Concepts and Techniques 19
Computing Information-Gain for
Continuous-Value Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point yielding maximum information gain for A is
selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
November 22, 2019 Data Mining: Concepts and Techniques 20
Gain Ratio for Attribute Selection (C4.5)
GainRatio(A) = Gain(A)/SplitEntropy(A)
4 4 6 6 4 4
Ex. SplitEntropy A ( D)
14
log 2 ( ) log 2 ( ) log 2 ( ) 0.926
14 14 14 14 14
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as
the splitting attribute
November 22, 2019 Data Mining: Concepts and Techniques 21
Gini index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index, gini(D) is
defined as n
gini( D) 1 p 2j
j 1
where pj is the relative frequency (proportion) of class j in D
If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
|D1| |D |
gini A (D) gini(D1) 2 gini(D2)
|D| |D|
Reduction in Impurity:
gini( A) gini(D) giniA(D)
The attribute provides the largest reduction in impurity is chosen to
split the node (need to enumerate all the possible splitting points for
each attribute)
November 22, 2019 Data Mining: Concepts and Techniques 22
Gini index (CART, IBM IntelligentMiner)
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D) 1 0.459
14 14
Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2 10 4
giniincome{low,medium} ( D) Gini ( D1 ) Gini ( D1 )
14 14
but gini{medium,high} is 0.30 and thus the best since it is the lowest
• Classification
• Clustering
• Regression
• Feature Selection
• Outlier detection
Review paper:
http://www.sciencedirect.com/science/article/pii/S003
1320310003973
November 22, 2019 Data Mining: Concepts and Techniques 36
An Example
Wikipedia
2
and P(xk|Ci) is
P(Xk | Ci) = g(xk , mCi , s Ci )
November 22, 2019 Data Mining: Concepts and Techniques 43
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
November 22, 2019 Data Mining: Concepts and Techniques 44
Naïve Bayesian Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
counterparts
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
November 22, 2019 Data Mining: Concepts and Techniques 48
Weka Demo
Vote Classification
http://people.csail.mit.edu/kersting/profile/PROFILE_nb.html
Classification:
predicts categorical class labels
x1 : # of a word “homepage”
x2 : # of a word “welcome”
Mathematically
x X = , y Y = {+1, –1}
n
We want a function f: X Y
• Perceptron: update W
additively
x1
https://www.youtube.com/watch?v=vGwemZhPls
A
Images.google.com
November 22, 2019 Data Mining: Concepts and Techniques 67
November 22, 2019 Data Mining: Concepts and Techniques 68
November 22, 2019 Data Mining: Concepts and Techniques 69
A Neuron (= a perceptron)
1 w0
x1
w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y sign( wi xi w0 )
vector x vector w sum function i 1
Output vector
Output layer
Hidden layer
wij
Input layer
Input vector: X
November 22, 2019 Data Mining: Concepts and Techniques 71
November 22, 2019 Data Mining: Concepts and Techniques 72
November 22, 2019 Data Mining: Concepts and Techniques 73
Activation / Transfer Function
Sigmoid function:
On sysbio.rnet.missouri.edu server
In /home/chengji/nnrank1.2
Weka (Java)
NNClass in NNRank package (C++, good
performance, very fast):
http://sysbio.rnet.missouri.edu/multicom_toolbox/
tools.html
Matlab
Pylearn2
Theano
Caffe
Torch
Cuda-convnet
Deeplearning4j
RapidMiner
Weka
R-programming
Orange (python)
KNIME
NLTK
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
http://www.youtube.com/watch?v=3liCbRZPrZA
SVM-light: http://svmlight.joachims.org/
LIBSVM:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Gist: http://bioinformatics.ubc.ca/gist/
Weka
SVM Website
http://www.kernel-machines.org/
Representative implementations
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01)
Efficiency: Uses an enhanced FP-tree that maintains the distribution of
class labels among tuples satisfying each frequent itemset
Rule pruning whenever a rule is inserted into the tree
Given two rules, R1 and R2, if the antecedent of R1 is more general
Instance-based learning:
Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
Typical approaches
k-nearest neighbor approach
space.
Locally weighted regression
_
_
_ _
.
+
_ .
+
xq + . . .
November 22, 2019
_ + .
Data Mining: Concepts and Techniques 134
Discussion on the k-NN Algorithm
KNN)
Other regression methods: generalized linear model, Poisson
(x x )( yi y )
w w y w x
i
i 1
1 | D|
0 1
(x
i 1
i x )2
X1 X2 … Die / Live
0.01 0.004 1
0.001 0.02 0
…
…
0.003 0.005 1
Binomial distribution:
For one dose level xi, yi = P(die|xi), likelihood = yiti(1-yi)(1-ti)
www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf
2. If all the points in the node have the same value for all the
independent variables, stop. Otherwise, search over the splits of all
variables for the one which will reduce S as much as possible. If the
largest decrease in S would be less than some threshold , or one of
the resulting nodes would contain less than q points, stop.
Otherwise, take that split, creating two new nodes.
www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf
d d
d
obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
Ensemble methods
Use a combination of models to increase accuracy
The bagged classifier M* counts the votes and assigns the class