0% found this document useful (0 votes)
12 views

Machine Learning - Unit 2

Uploaded by

sandt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Machine Learning - Unit 2

Uploaded by

sandt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 104

MACHINE LEARNING –UNIT 2

OUTLINE
 Logistic regression

 Exponential family

 Naïve Bayes

 Support vector machines

 Combining classifiers: Bagging, boosting (The Ada boost algorithm)

 Evaluating and debugging learning algorithms

 Classification errors.
LOGISTIC REGRESSION
 Why use logistic regression?
 Estimation by maximum likelihood
 Interpreting coefficients
 Hypothesis testing
 Evaluating the performance of the model
WHY USE LOGISTIC REGRESSION?

 There are many important research topics for


which the dependent variable is "limited."
 For example: voting, morbidity or mortality,
and participation data is not continuous or
distributed normally.
 Binary logistic regression is a type of
regression analysis where the dependent
variable is a dummy variable: coded 0 (did not
vote) or 1(did vote)
THE LINEAR PROBABILITY
MODEL
In the OLS regression:
Y =  + X + e ; where Y = (0, 1)
 The error terms are heteroskedastic

 e is not normally distributed because Y takes on


only two values
 The predicted probabilities can be greater than 1 or
less than 0
THE LOGISTIC REGRESSION
MODEL
The "logit" model :

ln[p/(1-p)] =  + X + e

 p is the probability that the event Y occurs,


p(Y=1)
 p/(1-p) is the "odds ratio"
 ln[p/(1-p)] is the log odds ratio, or "logit"
More:
 The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
 The estimated probability is:

p = 1/[1 + exp(- -  X)]

 if you let  +  X =0, then p = .50


 as  +  X gets really big, p approaches 1
 as  +  X gets really small, p approaches 0
COMPARING LP AND LOGIT MODELS

LP Model

1
Logit Model

0
MAXIMUM LIKELIHOOD
ESTIMATION (MLE)
 MLE is a statistical method for estimating the
coefficients of a model.
 The likelihood function (L) measures the
probability of observing the particular set of
dependent variable values (p1, p2, ..., pn) that
occur in the sample:
L = Prob (p1* p2* * * pn)
 The higher the L, the higher the probability of
observing the ps in the sample.
 MLE involves finding the coefficients ( , )
that makes the log of the likelihood function
(LL < 0) as large as possible
 Or, finds the coefficients that make -2 times
the log of the likelihood function (-2LL) as
small as possible
 The maximum likelihood estimates solve the
following condition:

{Y - p(Y=1)}Xi = 0

summed over all observations, i = 1,…,n


INTERPRETING COEFFICIENTS
 Since: ln[p/(1-p)] =  + X + e

The slope coefficient () is interpreted as the rate of


change in the "log odds" as X changes.
 Since:

p = 1/[1 + exp(- -  X)]


 An interpretation of the logit coefficient which is
usually more intuitive is the "odds ratio"
 Since: [p/(1-p)] = exp( + X)

exp() is the effect of the independent variable on the


"odds ratio"
LDA VS. LOGISTIC
REGRESSION
 LDA (Generative model)
 Assumes Gaussian class-conditional densities and a common covariance
 Model parameters are estimated by maximizing the full log likelihood,
parameters for each class are estimated independently of other classes,
Kp+p(p+1)/2+(K-1) parameters
 Makes use of marginal density information Pr(X)
 Easier to train, low variance, more efficient if model is correct
 Higher asymptotic error, but converges faster
 Logistic Regression (Discriminative model)
 Assumes class-conditional densities are members of the (same) exponential
family distribution
 Model parameters are estimated by maximizing the conditional log likelihood,
simultaneous consideration of all other classes, (K-1)(p+1) parameters
 Ignores marginal density information Pr(X)
 Harder to train, robust to uncertainty about the data generation process
 Lower asymptotic error, but converges more slowly
EXPONENTIAL FAMILY
EXPONENTIAL FAMILY -GAUSSIAN
EXPONENTIAL FAMILY
EXPONENTIAL FAMILY
EXPONENTIAL FAMILY
GENERATIVE MODEL-SUPERVISED
GENERATIVE MODEL-SUPERVISED
GAUSSIAN
GAUSSIAN
GENERATIVE MODEL-UNSUPERVISED
GENERATIVE MODEL-UNSUPERVISED
DISCRIMINATIVE MODEL SUPERVISED
DISCRIMINATIVE MODEL SUPERVISED
GENERATIVE VS.
DISCRIMINATIVE LEARNING

Generative Discriminative
Example Linear Discriminant Logistic Regression
Analysis

Objective Functions Full log likelihood: Conditional log likelihood


 log p ( x , y )
i
i i  log p ( y
i
i | xi )

Model Assumptions Class densities: Discriminant functions


p( x | y  k )
e.g. Gaussian in LDA k (x)
Parameter Estimation “Easy” – One single sweep “Hard” – iterative optimization

Advantages More efficient if model More flexible, robust because


correct, borrows strength fewer assumptions
from p(x)
Disadvantages Bias if model is incorrect May also be biased. Ignores
information in p(x)
NAÏVE BAYES
• Bayes classification
P(c | x)  P(x | c) P(c)  P( x1 ,  , xn | c) P(c) for c  c1 ,..., cL .
Difficulty: learning the joint probabilityP ( x1 ,  , xn | c) is infeasible!

• Naïve Bayes classification


– Assume all input features are class conditionally independent!
P( x1 , x2 ,  , xn |c)  P( x1 | x2 ,  , xn , c)P( x2 ,  , xn |c)
Applying  P( x1 |c)P( x2 ,  , xn |c)
the
independenc  P( x1 |c)P( x2 |c)    P( xn |c)
e
assumption x'  (a , a ,  , a )
1 2* n
–[ P(aApply  the
 MAP classification
 rule:
   assign   to c* if
1 ,  , c L
* * *
1 | c ) P ( a n | c )] P ( c ) [ P ( a1 | c ) P ( a n | c )] P ( c ), c c , c c

estimate of P(a1 ,  , an | c* ) esitmate of P(a1 ,  , an | c)


NAÏVE BAYES

For each target value of ci (ci  c1 ,  , cL )


Pˆ (ci )  estimate P(ci ) with examples in S;
For every feature value x jk of each feature x j ( j  1,  , F ; k  1,  , N j )
Pˆ ( x j  x jk |ci )  estimate P( x jk |ci ) with examples in S;

x'  (a1 ,  , an )

[Pˆ (a1 |c * )    Pˆ (an |c * )]Pˆ (c * )  [Pˆ (a1 |ci )    Pˆ (an |ci )]Pˆ (ci ), ci  c * , ci  c1 ,  , cL
EXAMPLE
• Example: Play Tennis
EXAMPLE
• Learning Phase
Outlook Play=Yes Play=No Temperatur Play=Yes Play=No
e
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Ye Play=N Wind Play=Yes Play=No
s o
High 3/9 4/5
Strong 3/9 3/5
Normal 6/9 1/5
Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14


EXAMPLE
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– Decision making with the MAP rule

P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.


NAÏVE BAYES
• Algorithm: Continuous-valued Features
– Numberless values taken by a continuous-valued feature
– Conditional probability often modeled with the normal
distribution 1  ( x j   ji ) 2 
Pˆ ( x j | ci )  exp  
2  ji  2 ji 
2

 ji : mean (avearage) of feature values x j of examples for which c  ci
 ji : standard deviation of feature values x j of examples for which c  ci
for X  ( X ,  , X ), C  c ,  , c
1 n 1 L
– Learning Phase:
n L P(C  ci ) i  1,  , L
Output: normal distributions and
  
– Test Phase: Given an unknown instanceX  ( a1 ,  , an )
• Instead of looking-up tables, calculate conditional probabilities with all
the normal distributions achieved in the learning phrase
• Apply the MAP rule to assign a label (the same as done for the discrete
case)
NAÏVE BAYES
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class

1 N 1 N Yes  21.64 , Yes  2.35


   xn ,    ( xn   )
2 2
 No  23.88 , No  7.09
N n1 N n1
– Learning Phase: output two Gaussian models for P(temp|C)

ˆ 1  ( x  21 . 64 ) 2
 1  ( x  21 . 64) 2

P ( x | Yes )  exp    exp  
2.35 2  2  2.35  2.35 2
2
 11.09 

ˆ 1  ( x  23 .88 ) 2
 1  ( x  23. 88 ) 2

P ( x | No)  
exp  
  
exp  
7.09 2  2  7.09  7.09 2
2
 50.25 
ZERO CONDITIONAL PROBABILITY

• If no example contains the feature value


– In this circumstance, we face a zero conditional probability
problem during test
Pˆ ( x1 | ci )    Pˆ (a jk | ci )    Pˆ ( xn | ci )  0 for x j  a jk , Pˆ (a jk | ci )  0
– For a remedy, class conditional probabilities re-estimated with

ˆ nc  mp
P (a jk | ci )  (m-estimate)
nm
nc : number of training examples for which x j  a jk and c  ci
n : number of training examples for which c  ci
p : prior estimate (usually, p  1 / t for t possible values of x j )
m : weight to prior (number of " virtual" examples, m  1)
ZERO CONDITIONAL PROBABILITY

• Example: P(outlook=overcast|no)=0 in the play-tennis


dataset
– Adding m “virtual” examples (m: up to 1% of #training
example)
• In this dataset, # of training examples for the “no” class is 5.
• We can only add m=1 “virtual” example in our m-esitmate remedy.
– The “outlook” feature can takes only 3 values. So p=1/3.
– Re-estimate P(outlook|no) with the m-estimate
SUPPORT VECTOR MACHINES
The Scalar Product
a The scalar or dot product is, in some sense, a measure of
Similarity

b
a  b  a b cos

DECISION FUNCTION
FOR BINARY
CLASSIFICATION f (x)  R
f ( xi )  0  yi  1
f xi   0  yi  1
SUPPORT VECTOR MACHINES

 SVMs pick best separating hyperplane according


to some criterion
e.g. maximum margin
 Training process is an optimisation

 Training set is effectively reduced to a relatively


small number of support vectors
FEATURE SPACES
 We may separate data by mapping to a higher-
dimensional feature space
The feature space may even have an infinite
number of dimensions!
 We need not explicitly construct the new feature
space
KERNELS
 We may use Kernel functions to implicitly map
to a new feature space
 Kernel fn:
K x1 , x 2  R
 Kernelmust be equivalent to an inner product in
some feature space
EXAMPLE KERNELS

Linear: xz

Polynomial: P x  z 

Gaussian:  2
exp  x  z /  2 
PERCEPTRON REVISITED: LINEAR
SEPARATORS
 Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx + b > 0
wTx + b < 0

f(x) = sign(wTx + b)
WHICH OF THE LINEAR SEPARATORS
IS OPTIMAL?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
BEST LINEAR SEPARATOR?
FIND CLOSEST POINTS IN CONVEX HULLS

d
c
PLANE BISECT CLOSEST POINTS

wT x + b =0
w=d-c
d
c
CLASSIFICATION MARGIN
wT x  b
 Distance from example data to the separator is r
w
 Data closest to the hyperplane are support vectors.

 Margin ρ of the separator is the width of separation between


classes. ρ

r
MAXIMUM MARGIN CLASSIFICATION
 Maximizing the margin is good according to intuition and
theory.
 Implies that only support vectors are important; other training
examples are ignorable.
STATISTICAL LEARNING THEORY
 Misclassificationerror and the function
complexity bound generalization error.
 Maximizing margins minimizes complexity.

 “Eliminates” overfitting.

 Solution depends only on Support Vectors not


number of attributes.
MARGINS AND COMPLEXITY

Skinny margin
is more flexible
thus more complex.
MARGINS AND COMPLEXITY

Fat margin
is less complex.
LINEAR SVM MATHEMATICALLY
 Assuming all data is at distance larger than 1 from the
hyperplane, the following two constraints follow for a
training set {(xi ,yi)}
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ -1 if yi = -1

 For support vectors, the inequality becomes an


equality; then, since each example’s distance from the
wT x  b 
2
r
w w
 hyperplane is the margin is:
LINEAR SVMS MATHEMATICALLY
(CONT.)
 Then we can formulate the quadratic optimization
problem:
Find w and b such that
2

w is maximized and for all {(xi ,yi)}
wTxi + b ≥ 1 if yi=1; wTxi + b ≤ -1 if yi = -1

Find w and b such that

Φ(w) =½ wTw is minimized and for all {(xi ,yi)}


yi (wTxi + b) ≥ 1
SOLVING THE OPTIMIZATION PROBLEM
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1

 Quadratic optimization problems are a well-known class of


mathematical programming problems, and many (rather
intricate) algorithms exist for solving them.
 The solution involves constructing a dual problem where a
Lagrange multiplier αi is associated with every constraint in the
primary problem:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
THE OPTIMIZATION PROBLEM
SOLUTION
 The solution has the form:

w =Σαiyixi b= yk- wTxk for any xk such that αk 0

 Each non-zero αi indicates that corresponding xi is a support vector.


 Then the classifying function will have the form:
f(x) = ΣαiyixiTx + b

 Notice that it relies on an inner product between the test point x


and the support vectors xi – we will return to this later!
 Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points!
SOFT MARGIN CLASSIFICATION
 What if the training set is not linearly separable?
 Slack variables ξi can be added to allow misclassification

of difficult or noisy examples.

ξi
ξi
SOFT MARGIN CLASSIFICATION
MATHEMATICALLY
 The old formulation:
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1

 The new formulation incorporating slack variables:


Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

 Parameter C can be viewed as a way to control overfitting.


SOFT MARGIN CLASSIFICATION –
SOLUTION
 The dual problem for soft margin classification:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi

 Neither slack variables ξi nor their Lagrange multipliers


appear in the dual problem!
 Again, xi with non-zero αi will be support vectors.

 Solution to the dual problem is:


w =Σαiyixi k f(x) = ΣαiyixiTx + b
b= yk(1- ξk) - wTxk where k = argmax αk
THEORETICAL JUSTIFICATION FOR
MAXIMUM MARGINS
 Vapnik has proved the following:
The class of optimal linear separators has VC dimension h bounded
from above as  D 2  
h  min 2  , m 1

  
0

where ρ is the margin, D is the diameter of the smallest sphere that
can enclose all of the training examples, and m0 is the
dimensionality.

 Intuitively, this implies that regardless of dimensionality m0 we can


minimize the VC dimension by maximizing the margin ρ.

 Thus, complexity of the classifier is kept small regardless of


dimensionality.
LINEAR SVMS: OVERVIEW
 The classifier is a separating hyperplane.
 Most “important” training points are support vectors; they
define the hyperplane.
 Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi. f(x) = ΣαiyixiTx + b
 Both in the dual formulation of the problem and in the solution
training points appear only inside inner products:
EXAMPLE
EXAMPLE
NON-LINEAR SVMS
 Datasets that are linearly separable with some noise
work out great:

0 x

 But what are we going to do if the dataset is just too


hard?

0 x
 How about… mapping data to a higher-dimensional
space: x2

0 x
NONLINEAR CLASSIFICATION

x   a, b 
xw  w1a  w2b

 ( x)   a, b, ab, a , b 
2 2

 ( x)w  w1a  w2b  w3ab  w4 a  w5b


2 2
NON-LINEAR SVMS: FEATURE SPACES
 General idea: the original feature space can always be
mapped to some higher-dimensional feature space where
the training set is separable:
Φ: x → φ(x)
THE “KERNEL TRICK”
 The linear classifier relies on inner product between vectors
K(xi,xj)=xiTxj
 If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
 A kernel function is some function that corresponds to an inner product
into some feature space.
 Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Positive Definite Matrices
A square matrix A is positive definite if xTAx>0 for
all nonzero column vectors x.

It is negative definite if xTAx < 0 for all nonzero x.

It is positive semi-definite if xTAx  0.

And negative semi-definite if xTAx  0 for all x.


WHAT FUNCTIONS ARE KERNELS?
 For some functions K(xi,xj) checking that K(xi,xj)= φ(xi)
T
φ(xj) can be cumbersome.
 Mercer’s theorem:
Every semi-positive definite symmetric function is a
kernel
 Semi-positive definite symmetric functions correspond
to a semi-positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xN)
K= K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xN)

… … … … …
K(xN,x1) K(xN,x2) K(xN,x3) … K(xN,xN)
EXAMPLES OF KERNEL FUNCTIONS
 Linear: K(xi,xj)= xi Txj

 Polynomial of power p: K(xi,xj)= (1+ xi Txj)p 2


xi x j

2 2
e
 Gaussian (radial-basis function network): K(xi,xj)=

 Two-layer perceptron: K(xi,xj)= tanh(β0xi Txj + β1)


NON-LINEAR SVMS MATHEMATICALLY
 Dual problem formulation:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

 The
f(x) solution
= Σα y K(xis:
, x )+ b
i i i j

 Optimization techniques for finding αi’s remain the same!


EXAMPLE
SVM APPLICATIONS
 SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.

 SVMs are currently among the best performers for a number of


classification tasks ranging from text to genomic data.

 SVM techniques have been extended to a number of tasks such as


regression [Vapnik et al. ’97], principal component analysis
[Schölkopf et al. ’99], etc.

 Most popular optimization algorithms for SVMs are SMO [Platt ’99]
and SVMlight [Joachims’ 99], both use decomposition to hill-climb over
a subset of αi’s at a time.

 Tuning SVMs remains a black art: selecting a specific kernel and


parameters is usually done in a try-and-see manner.
SVM EXTENSIONS
 Regression

 VariableSelection
 Boosting

 Density Estimation

 Unsupervised Learning
Novelty/Outlier Detection
Feature Detection
Clustering
LEARNING ENSEMBLES
 Learn multiple alternative definitions of a concept using
different training data or different learning algorithms.
 Combine decisions of multiple definitions, e.g. using
weighted voting.
Training Data

Data1 Data2  Data m

Learner1 Learner2  Learner m

Model1 Model2  Model m

Model Combiner Final Model


VALUE OF ENSEMBLES
 When combing multiple independent and diverse
decisions each of which is at least more accurate than
random guessing, random errors cancel each other out,
correct decisions are reinforced.
 Human ensembles are demonstrably better
 How many jelly beans in the jar?: Individual estimates vs.
group average.
 Who Wants to be a Millionaire: Expert friend vs. audience
vote.
HOMOGENOUS ENSEMBLES
 Use a single, arbitrary learning algorithm but
manipulate training data to make it learn
multiple models.
 Data1  Data2  …  Data m
 Learner1 = Learner2 = … = Learner m
 Different methods for changing training data:
 Bagging: Resample training data
 Boosting: Reweight training data
 DECORATE: Add additional artificial training data
 In WEKA, these are called meta-learners, they
take a learning algorithm as an argument (base
learner) and create a new learning algorithm.
BAGGING
 Create ensembles by repeatedly randomly resampling the
training data (Brieman, 1996).
 Given a training set of size n, create m samples of size n
by drawing n examples from the original data, with
replacement.
 Each bootstrap sample will on average contain 63.2% of the
unique training examples, the rest are replicates.
 Combine the m resulting models using simple majority
vote.
 Decreases error by decreasing the variance in the results
due to unstable learners, algorithms (like decision trees)
whose output can change dramatically when the training
data is slightly changed.
BOOSTING
 Originally developed by computational learning theorists
to guarantee performance improvements on fitting
training data for a weak learner that only needs to
generate a hypothesis with a training accuracy greater
than 0.5 (Schapire, 1990).
 Revised to be a practical algorithm, AdaBoost, for
building ensembles that empirically improves
generalization performance (Freund & Shapire, 1996).
 Examples are given weights. At each iteration, a new
hypothesis is learned and the examples are reweighted to
focus the system on examples that the most recently
learned classifier got wrong.
BOOSTING: BASIC ALGORITHM
 General Loop:
Set all examples to have equal uniform weights.
For t from 1 to T do:
Learn a hypothesis, ht, from the weighted examples
Decrease the weights of examples ht classifies correctly
 Base (weak) learner must focus on correctly classifying the
most highly weighted examples while strongly avoiding over-
fitting.
 During testing, each of the T hypotheses get a weighted vote
proportional to their accuracy on the training data.
ADABOOST PSEUDOCODE
TrainAdaBoost(D, BaseLearn)
For each example di in D let its weight wi=1/|D|
Let H be an empty set of hypotheses
For t from 1 to T do:
Learn a hypothesis, ht, from the weighted examples: ht=BaseLearn(D)
Add ht to H
Calculate the error, εt, of the hypothesis ht as the total sum weight of the
examples that it classifies incorrectly.
If εt > 0.5 then exit loop, else continue.
Let βt = εt / (1 – εt )
Multiply the weights of the examples that ht classifies correctly by βt
Rescale the weights of all of the examples so the total sum weight remains 1.
Return H

TestAdaBoost(ex, H)
Let each hypothesis, ht, in H vote for ex’s classification with weight log(1/ βt )
Return the class with the highest weighted vote total.
LEARNING WITH WEIGHTED
EXAMPLES
 Generic approach is to replicate examples in the
training set proportional to their weights (e.g. 10
replicates of an example with a weight of 0.01
and 100 for one with weight 0.1).
 Most algorithms can be enhanced to efficiently
incorporate weights directly in the learning
algorithm so that the effect is the same (e.g.
implement the WeightedInstancesHandler
interface in WEKA).
 For decision trees, for calculating information
gain, when counting example i, simply
increment the corresponding count by wi rather
than by 1.
EXAMPLE

Original training set: equal weights to all training samples

Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire


ADABOOST EXAMPLE
ε = error rate of classifier
α = weight of classifier
ROUND 1
ADABOOST EXAMPLE
ROUND 2
ADABOOST EXAMPLE

ROUND 3
ADABOOST EXAMPLE
HOW IS CLASSIFIER COMBINING
DONE?
 At each stage we select the best classifier on the current iteration
and combine it with the set of classifiers learned so far

 How are the classifiers combined?


 Take the weight*feature for each classifier, sum these up, and
compare to a threshold (very simple)

 Boosting algorithm automatically provides the appropriate weight for


each classifier and the threshold

 This version of boosting is known as the AdaBoost algorithm

 Some nice mathematical theory shows that it is in fact a very


powerful machine learning technique
EVALUATING MACHINE LEARNING
ALGORITHMS

•You have developed a machine learning approach to a certain

task, and want to validate that it actually works well (or

determine if it doesn't work well)

• Standard: approach, divide data into training and testing sets,

train method on training set, and report results on the testing set

• Important: testing set is not the same as the validation test


EVALUATING MACHINE LEARNING
ALGORITHMS

The proper way to evaluate an ML algorithm


1. Break all data into training/testing sets (e.g., 70%/30%)
2. Break training set into training/validation set (e.g., 70%/30%
again)
3. Choose hyperparameters using validation set
4. (Optional) Once we have selected hyperparameters, retrain
using all the training set
5. Evaluate performance on the testing set
CLASSIFIER EVALUATION METRICS: CONFUSION
MATRIX
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

 Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
94
CLASSIFIER EVALUATION METRICS:
ACCURACY, ERROR RATE, SENSITIVITY AND
SPECIFICITY
A\P C ¬C  Class Imbalance Problem:
C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

 Classifier Accuracy, or negative class and minority of


recognition rate: percentage of test the positive class
 Sensitivity: True Positive
set tuples that are correctly
classified recognition rate
 Sensitivity = TP/P
Accuracy = (TP + TN)/All
 Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate


 Specificity = TN/N

95
CLASSIFIER EVALUATION METRICS:
PRECISION AND RECALL, AND F-MEASURES
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

 Recall: completeness – what % of positive tuples did the


classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F or F-score): harmonic mean of precision and
1
recall,

 Fß: weighted measure of precision and recall


 assigns ß times as much weight to recall as to precision

96
CLASSIFIER EVALUATION METRICS: EXAMPLE

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

97
EVALUATING CLASSIFIER ACCURACY:
HOLDOUT & CROSS-VALIDATION METHODS
 Holdout method
 Given data is randomly partitioned into two independent sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation

 Random sampling: a variation of holdout


 Repeat holdout k times, accuracy = avg. of the accuracies obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
 At i-th iteration, use D as test set and others as training set
i
 Leave-one-out: k folds where k = # of tuples, for small sized
data
 *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data

98
EVALUATING CLASSIFIER ACCURACY:
BOOTSTRAP

 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is equally likely to be selected again
and re-added to the training set
 Several bootstrap methods, and a common one is .632 boostrap
 A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since (1
– 1/d)d ≈ e-1 = 0.368)
 Repeat the sampling procedure k times, overall accuracy of the model:

99
DEBUGGING
DEBUGGING

You might also like