Machine Learning - Unit 2
Machine Learning - Unit 2
OUTLINE
Logistic regression
Exponential family
Naïve Bayes
Classification errors.
LOGISTIC REGRESSION
Why use logistic regression?
Estimation by maximum likelihood
Interpreting coefficients
Hypothesis testing
Evaluating the performance of the model
WHY USE LOGISTIC REGRESSION?
ln[p/(1-p)] = + X + e
LP Model
1
Logit Model
0
MAXIMUM LIKELIHOOD
ESTIMATION (MLE)
MLE is a statistical method for estimating the
coefficients of a model.
The likelihood function (L) measures the
probability of observing the particular set of
dependent variable values (p1, p2, ..., pn) that
occur in the sample:
L = Prob (p1* p2* * * pn)
The higher the L, the higher the probability of
observing the ps in the sample.
MLE involves finding the coefficients ( , )
that makes the log of the likelihood function
(LL < 0) as large as possible
Or, finds the coefficients that make -2 times
the log of the likelihood function (-2LL) as
small as possible
The maximum likelihood estimates solve the
following condition:
{Y - p(Y=1)}Xi = 0
Generative Discriminative
Example Linear Discriminant Logistic Regression
Analysis
[Pˆ (a1 |c * ) Pˆ (an |c * )]Pˆ (c * ) [Pˆ (a1 |ci ) Pˆ (an |ci )]Pˆ (ci ), ci c * , ci c1 , , cL
EXAMPLE
• Example: Play Tennis
EXAMPLE
• Learning Phase
Outlook Play=Yes Play=No Temperatur Play=Yes Play=No
e
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Ye Play=N Wind Play=Yes Play=No
s o
High 3/9 4/5
Strong 3/9 3/5
Normal 6/9 1/5
Weak 6/9 2/5
P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
ˆ 1 ( x 21 . 64 ) 2
1 ( x 21 . 64) 2
P ( x | Yes ) exp exp
2.35 2 2 2.35 2.35 2
2
11.09
ˆ 1 ( x 23 .88 ) 2
1 ( x 23. 88 ) 2
P ( x | No)
exp
exp
7.09 2 2 7.09 7.09 2
2
50.25
ZERO CONDITIONAL PROBABILITY
ˆ nc mp
P (a jk | ci ) (m-estimate)
nm
nc : number of training examples for which x j a jk and c ci
n : number of training examples for which c ci
p : prior estimate (usually, p 1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m 1)
ZERO CONDITIONAL PROBABILITY
b
a b a b cos
DECISION FUNCTION
FOR BINARY
CLASSIFICATION f (x) R
f ( xi ) 0 yi 1
f xi 0 yi 1
SUPPORT VECTOR MACHINES
Linear: xz
Polynomial: P x z
Gaussian: 2
exp x z / 2
PERCEPTRON REVISITED: LINEAR
SEPARATORS
Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0
f(x) = sign(wTx + b)
WHICH OF THE LINEAR SEPARATORS
IS OPTIMAL?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
FIND CLOSEST POINTS IN CONVEX HULLS
d
c
PLANE BISECT CLOSEST POINTS
wT x + b =0
w=d-c
d
c
CLASSIFICATION MARGIN
wT x b
Distance from example data to the separator is r
w
Data closest to the hyperplane are support vectors.
r
MAXIMUM MARGIN CLASSIFICATION
Maximizing the margin is good according to intuition and
theory.
Implies that only support vectors are important; other training
examples are ignorable.
STATISTICAL LEARNING THEORY
Misclassificationerror and the function
complexity bound generalization error.
Maximizing margins minimizes complexity.
“Eliminates” overfitting.
Skinny margin
is more flexible
thus more complex.
MARGINS AND COMPLEXITY
Fat margin
is less complex.
LINEAR SVM MATHEMATICALLY
Assuming all data is at distance larger than 1 from the
hyperplane, the following two constraints follow for a
training set {(xi ,yi)}
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ -1 if yi = -1
ξi
ξi
SOFT MARGIN CLASSIFICATION
MATHEMATICALLY
The old formulation:
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1
0 x
0 x
How about… mapping data to a higher-dimensional
space: x2
0 x
NONLINEAR CLASSIFICATION
x a, b
xw w1a w2b
( x) a, b, ab, a , b
2 2
… … … … …
K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)
EXAMPLES OF KERNEL FUNCTIONS
Linear: K(xi,xj)= xi Txj
The
f(x) solution
= Σα y K(xis:
, x )+ b
i i i j
Most popular optimization algorithms for SVMs are SMO [Platt ’99]
and SVMlight [Joachims’ 99], both use decomposition to hill-climb over
a subset of αi’s at a time.
VariableSelection
Boosting
Density Estimation
Unsupervised Learning
Novelty/Outlier Detection
Feature Detection
Clustering
LEARNING ENSEMBLES
Learn multiple alternative definitions of a concept using
different training data or different learning algorithms.
Combine decisions of multiple definitions, e.g. using
weighted voting.
Training Data
TestAdaBoost(ex, H)
Let each hypothesis, ht, in H vote for ex’s classification with weight log(1/ βt )
Return the class with the highest weighted vote total.
LEARNING WITH WEIGHTED
EXAMPLES
Generic approach is to replicate examples in the
training set proportional to their weights (e.g. 10
replicates of an example with a weight of 0.01
and 100 for one with weight 0.1).
Most algorithms can be enhanced to efficiently
incorporate weights directly in the learning
algorithm so that the effect is the same (e.g.
implement the WeightedInstancesHandler
interface in WEKA).
For decision trees, for calculating information
gain, when counting example i, simply
increment the corresponding count by wi rather
than by 1.
EXAMPLE
ROUND 3
ADABOOST EXAMPLE
HOW IS CLASSIFIER COMBINING
DONE?
At each stage we select the best classifier on the current iteration
and combine it with the set of classifiers learned so far
train method on training set, and report results on the testing set
95
CLASSIFIER EVALUATION METRICS:
PRECISION AND RECALL, AND F-MEASURES
Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
96
CLASSIFIER EVALUATION METRICS: EXAMPLE
97
EVALUATING CLASSIFIER ACCURACY:
HOLDOUT & CROSS-VALIDATION METHODS
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
98
EVALUATING CLASSIFIER ACCURACY:
BOOTSTRAP
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected again
and re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since (1
– 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the model:
99
DEBUGGING
DEBUGGING