datamining-lect12
datamining-lect12
LECTURE 12
Classification
Nearest Neighbor Classification
Support Vector Machines
Logistic Regression
Naïve Bayes Classifier
Supervised Learning
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
NEAREST NEIGHBOR
CLASSIFICATION
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Instance-Based Classifiers
Set of Stored Cases • Store the training records
• Use training records to
Atr1 ……... AtrN Class predict the class label of
A unseen cases
B
B
Unseen Case
C
A Atr1 ……... AtrN
C
B
Instance Based Classifiers
• Examples:
• Rote-learner
• Memorizes entire training data and performs classification only
if attributes of record match one of the training examples exactly
Compute
Distance Test
Record
X X X
d ( p, q ) ( pi
i
q)
i
2
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
• Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines
B1
B2
B2
B2
margin
b11
b12
w x b 0
w x b 1
w x b 1
b11
b12
1 if w x b 1 2
f ( x ) Margin
1 if w x b 1 || w ||
Support Vector Machines
2
• We want to maximize: Margin
|| w ||
2
• Which is equivalent to minimizing: L( w)
|| w ||
2
• But subjected to the following constraints:
if
if
⃗
𝑤 ⋅⃗
𝑥 + 𝑏=− 1+𝜉 𝑖
𝜉𝑖
‖𝑤‖
Support Vector Machines
• What if the problem is not linearly separable?
• Introduce slack variables
• Need to minimize:
2
|| w || N k
L( w) C i
2 i 1
• Subject to:
if
if
Nonlinear Support Vector Machines
• What if decision boundary is not linear?
Nonlinear Support Vector Machines
• Transform data into higher dimensional space
P ( A | C ) P (C )
• Bayes Theorem: P (C | A)
P ( A)
Bayesian Classifiers
• How to classify the new record X = (‘Yes’, ‘Single’, 80K)
Evade C
Event space: {Yes, No}
Tid Refund M arital Taxable P(C) = (0.3, 0.7)
Status Incom e Evade
Refund A1
1 Yes Single 125K No
2 No M arried 100K No
Event space: {Yes, No}
3 No Single 70K No
P(A1) = (0.3,0.7)
4 Yes M arried 120K No Martial Status A2
5 No Divorced 95K Yes Event space: {Single, Married, Divorced}
6 No M arried 60K No P(A2) = (0.4,0.4,0.2)
7 Yes Divorced 220K No
Taxable Income A3
8 No Single 85K Yes
Event space: R
9 No M arried 75K No
P(A3) ~ Normal(,2)
10 No Single 90K Yes
10
P( A A A )
1 2 n
1 2 n
• Maximizing
P(C | A1, A2, …, An)
is equivalent to maximizing
P(A1, A2, …, An|C) P(C)
• The value is the same for all values of C.
: number of instances
Tid Refund M arital
Status
Taxable
Incom e Evade having attribute and belong
1 Yes Single 125K No to class
: number of instances of
2 No M arried 100K No
3 No Single 70K No
4 Yes M arried 120K No class
5 No Divorced 95K Yes
6 No M arried 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No M arried 75K No
10 No Single 90K Yes
10
How to Estimate Probabilities from Data?
Discrete attributes:
: number of instances
Tid Refund M arital
Status
Taxable
Incom e Evade having attribute and belong
1 Yes Single 125K No to class
: number of instances of
2 No M arried 100K No
3 No Single 70K No
4 Yes M arried 120K No class
5 No Divorced 95K Yes
6 No M arried 60K No
7 Yes Divorced 220K No
P(Refund = Yes | No) = 3/7
8 No Single 85K Yes
9 No M arried 75K No
10 No Single 90K Yes
10
How to Estimate Probabilities from Data?
Discrete attributes:
: number of instances
Tid Refund M arital
Status
Taxable
Incom e Evade having attribute and belong
1 Yes Single 125K No to class
: number of instances of
2 No M arried 100K No
3 No Single 70K No
4 Yes M arried 120K No class
5 No Divorced 95K Yes
6 No M arried 60K No
7 Yes Divorced 220K No
P(Refund = Yes|Yes) = 0
8 No Single 85K Yes
9 No M arried 75K No
10 No Single 90K Yes
10
How to Estimate Probabilities from Data?
Discrete attributes:
: number of instances
Tid Refund M arital
Status
Taxable
Incom e Evade having attribute and belong
1 Yes Single 125K No to class
: number of instances of
2 No M arried 100K No
3 No Single 70K No
4 Yes M arried 120K No class
5 No Divorced 95K Yes
6 No M arried 60K No
7 Yes Divorced 220K No
P(Status=Single|No) = 2/7
8 No Single 85K Yes
9 No M arried 75K No
10 No Single 90K Yes
10
How to Estimate Probabilities from Data?
Discrete attributes:
: number of instances
Tid Refund M arital
Status
Taxable
Incom e Evade having attribute and belong
1 Yes Single 125K No to class
: number of instances of
2 No M arried 100K No
3 No Single 70K No
4 Yes M arried 120K No class
5 No Divorced 95K Yes
6 No M arried 60K No
7 Yes Divorced 220K No
P(Status=Single|Yes) = 2/3
8 No Single 85K Yes
9 No M arried 75K No
10 No Single 90K Yes
10
How to Estimate Probabilities from Data?
• Normal distribution:
( a ij ) 2
1 2 ij2
Tid Refund M arital Taxable P( Ai a | c j ) e
Status Incom e Evade
2 2
ij
1 Yes Single 125K No
2 No M arried 100K No • One for each pair
3 No Single 70K No
4 Yes M arried 120K No • For Class=No
5 No Divorced 95K Yes
6 No M arried 60K No
• sample mean μ = 110
7 Yes Divorced 220K No
• sample variance σ2= 2975
8 No Single 85K Yes
9 No M arried 75K No • For Income = 80
10 No Single 90K Yes
10
( 80 110 ) 2
1
2 ( 2975)
P ( Income 80 | No) e 0.0062
2 (54.54)
How to Estimate Probabilities from Data?
• Normal distribution:
( a ij ) 2
1 2 ij2
Tid Refund M arital Taxable P( Ai a | c j ) e
Status Incom e Evade
2 2
ij
1 Yes Single 125K No
2 No M arried 100K No • One for each pair
3 No Single 70K No
4 Yes M arried 120K No • For Class=Yes
5 No Divorced 95K Yes
6 No M arried 60K No
• sample mean μ = 90
7 Yes Divorced 220K No
• sample variance σ2= 25
8 No Single 85K Yes
9 No M arried 75K No • For Income = 80
10 No Single 90K Yes
10
( 80 90 ) 2
1
2 ( 25 )
P ( Income 80 | Yes ) e 0.01
2 (5)
Example
• Record
X = (Refund = Yes, Status = Single, Income =80K)
• We compute:
• P(C = Yes|X) = P(C = Yes)*P(Refund = Yes |C = Yes)
*P(Status = Single |C = Yes)
*P(Income =80K |C= Yes)
= 3/10* 0 * 2/3 * 0.01 = 0
• P(C = No|X) = P(C = No)*P(Refund = Yes |C = No)
*P(Status = Single |C = No)
*P(Income =80K |C= No)
= 7/10 * 3/7 * 2/7 * 0.0062 = 0.0005
Example of Naïve Bayes Classifier
• Creating a Naïve Bayes Classifier, essentially
means to compute counts:
Total number of records: N = 10
P(X|Class=No) = P(Refund=Yes|Class=No)
P(Single| Class=No)
P(Income=80K|
Class=No)
= 3/7 * 2/7 * 0.0062 = 0.00075
• Laplace Smoothing:
Fraction of
documents in c
• = Fraction of terms from all documents in c that are.
Number of times
appears in all Laplace Smoothing
documents in c
Number of unique words
Total number of terms in all documents in c (vocabulary size)
w
• Equivalently: There is an automaton spitting
words from the above distribution
Example
News titles for Politics and Sports
Politics Sports
“Obama meets Merkel” “OSFP European basketball champion”
documents “Obama elected again” “Miami NBA basketball champion”
“Merkel visits Greece again” “Greece basketball coach?”
P(p) = 0.5 P(s) = 0.5
terms obama:2, meets:1, merkel:2, OSFP:1, european:1, basketball:3,
Vocabulary elected:1, again:2, visits:1, champion:2, miami:1, nba:1,
size: 14 greece:1 greece:1, coach:1
Total terms: 10 Total terms: 11
P(Sports|X) ~ P(s)*P(obama|s)*P(likes|s)*P(basketball|s)
= 0.5 * 1/(11+14) *1/(11+14) * 4/(11+14) = 0.000128
Naïve Bayes (Summary)
• Robust to isolated noise points
C
• Conditional independence given C
𝐴1 𝐴2 𝐴𝑛
Google lecture:
Theorizing from the Data
Apply-Test
• How do you scale to very large datasets?
• Distributed computing – map-reduce implementations of
machine learning algorithms (Mahaut, over Hadoop)