Unit 2 - SVM
Unit 2 - SVM
Temperature
Humidity
= play tennis
= do not play tennis
Linear Support Vector Machines
Data: <xi,yi>, i=1,..,l
xi R d
yi {-1,+1}
x2
=+1
=-1
x1
Linear SVM 2
f(x) =-1
=+1
H1
H
H2
Recall the distance from a point(x0,y0) to a line: d+
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||
Testing
Classify each new instance by voting of the k
classifiers (equal weights)
Bootstrap distribution
• The bootstrap does not replace or add to the
original data.
• Bootstrapping:
– One original sample B bootstrap samples
– B bootstrap samples bootstrap distribution
• Bootstrap distributions usually approximate the
shape, spread, and bias of the actual sampling
distribution.
• …
Bagging Example
Original 1 2 3 4 5 6 7 8
Training set 1 2 7 8 3 7 6 3 1
Training set 2 7 8 5 6 4 2 7 1
Training set 3 3 6 2 7 5 6 2 2
Training set 4 4 5 1 4 6 4 3 8
Bagging (cont …)
• When does it help?
– When learner is unstable
• Small change to training set causes large
change in the output classifier
• True for decision trees, neural networks; not
true for k-nearest neighbor, naïve Bayesian,
class association rules
– Experimentally, bagging can help
substantially for unstable learners, may
somewhat degrade results for stable
learners
Bagging
For i = 1 .. M
Draw n*<n samples from D with replacement
Learn classifier Ci
Final classifier is a vote of C1 .. CM
Increases classifier stability/reduces variance
Boosting
• A family of methods:
– We only study AdaBoost (Freund &
Schapire, 1996)
• Training
– Produce a sequence of classifiers (the
same base learner)
– Each classifier is dependent on the
previous one, and focuses on the
previous one’s errors
– Examples that are incorrectly predicted
in previous classifiers are given higher
weights
• Testing
– For a test case, the results of the series
of classifiers are combined to determine
the final class of the test case.
AdaBoost
Weighted called a weaker
classifier
training set
(x1, y1, w1) Build a
(x2, y2, w2) classifier ht
… whose
(xn, yn, wn)
accuracy on
training set >
Non-negative
½ (better than
weights
random)
sum toChange
1
weights
AdaBoost algorithm
Bagging, Boosting and C4.5
C4.5’s mean
error rate
over the
10 cross-
validation.
Bagged C4.5
vs. C4.5.
Boosted C4.5
vs. C4.5.
Boosting vs.
Bagging
Does AdaBoost always work?
• The actual performance of boosting
depends on the data and the base
learner.
– It requires the base learner to be
unstable as bagging.
• Boosting seems to be susceptible to
noise.
– When the number of outliners is very
large, the emphasis placed on the hard
examples can hurt the performance.
What is Clustering?
• Attach label to each observation or data points in a set
• You can say this “unsupervised classification”
• Clustering is alternatively called as “grouping”
• Intuitively, if you would want to assign same label to a data
points that are “close” to each other
• Thus, clustering algorithms rely on a distance metric between
data points
• Sometimes, it is said that for clustering, the distance metric is
more important than the clustering algorithm
Distances: Quantitative Variables
Data point:
xi [ xi1 xip ]T
Some examples
Types of clustering:
1. Hierarchical algorithms: these find successive clusters
using previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative algorithms
begin with each element as a separate cluster and merge them
into successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with the
whole set and proceed to divide it into successively smaller
clusters.
2. Partitional clustering: Partitional algorithms determine all clusters at
once. They include:
– K-means and derivatives
– Fuzzy c-means clustering
– QT clustering algorithm
Common Distance measures:
Step 1 Step 2
PLOT
How K-means partitions?
A partition amounts to a
Voronoi Diagram
• Generalized K-means
• Computationally much costlier that K-means
• Apply when dealing with categorical data
• Apply when data points are not available, but only
pair-wise distances are available
• Converges to local minimum
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal
vertebrate invertebrate
• Agglomerative (bottom-up):
– Start with each document being a single cluster.
– Eventually all documents belong to the same cluster.
• Divisive (top-down):
– Start with all documents belong to the same cluster.
– Eventually each node forms a cluster on its own.
– Could be a recursive application of k-means like algorithms
• Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
Hierarchical Agglomerative Clustering
(HAC)
• Starts with each doc in a separate cluster
– then repeatedly joins the closest pair of
clusters, until there is only one cluster.
• The history of merging forms a binary tree
or hierarchy.
How to measure distance of clusters??
Closest pair of clusters
Many variants to defining closest pair of clusters
• Single-link
– Distance of the “closest” points (single-link)
• Complete-link
– Distance of the “furthest” points
• Centroid
– Distance of the centroids (centers of gravity)
• (Average-link)
– Average distance between pairs of elements
Single Link Agglomerative Clustering
• Use maximum similarity of pairs:
P1 0
P2 0.23 0
P3 0.22 0.15 0
P1 0
P2 0.23 0
P3 P6 0.22 0.15 0
P1 0
P2, P5 0.23 0
P3 P6 0.22 0.15 0
P2, P5 , 0.22 0
P3P6
P4 0.37 0.15 0
Merge
• P2P5P3P6, P4 is small – so merge them as one
cluster
P1 P2, P5 P3
P6P4
P1 0
P2, P5 , 0.22 0
P3P6 P4
Complete Link Agglomerative
Clustering
• Use minimum similarity of pairs:
sim (ci ,c j ) min sim ( x, y)
xci , yc j
P1 0
P2 0.23 0
P3 0.22 0.15 0
P1 0
P2 0.23 0
P1 P2P5 P3P6 P4
P1 0
P2P5 0.34 0
P1 P2P5 P3P6P4
P1 0
P2P5 0.34 0
P1P2P5 P3P6P4
P1P2P5 0
P3P6P4 0.39 0
Key notion: cluster representative
• We want a notion of a representative point in
a cluster
• Representative should be some sort of
“typical” or central point in the cluster, e.g.,
– point inducing smallest radii to docs in cluster
– smallest squared distances, etc.
– point that is the “average” of all docs in the cluster
• Centroid or center of gravity
Centroid-based Similarity
s W I L L I A M _ C O H E N
t W I L L L I A M _ C O H O N
op C C C C I C C C C C C C S C
cost 0 0 0 0 1 1 1 1 1 1 1 1 2 2
Levenstein distance - example
• distance(“William Cohen”, “Willliam Cohon”)
s W I L L gap I A M _ C O H E N
t W I L L L I A M _ C O H O N
op C C C C I C C C C C C C S C
cost 0 0 0 0 1 1 1 1 1 1 1 1 2 2
Computing Levenshtein distance
C O H E N
M 1 2 3 4 5
C 1 2 3 4 5
C 2 2 3 4 5
O 3 2 3 4 5
H 4 3 2 3 4
N 5 4 3 3 3
= D(s,t)
Computing Levenstein distance
C O H E N
A trace indicates where 2 3 4 5
M 1
the min value came
from, and can be used to C 2 3 4 5
find edit operations
1
and/or a best alignment C 2 3 3 4 5
(may be more than 1)
O 3 2 3 4 5
H 4 3 2 3 4
N 5 4 3 3 3
Large Clustering Problems
• Many examples
• Many clusters
• Many dimensions
Example Domains
Text
Images
Protein structure
EXPECTATION MAXIMIZATION
ALGORITHM (EM)
Model Space
• The choice of the model space is plentiful but
not unlimited.
• There is a bit of “art” in selecting the
appropriate model space.
• Typically the model space is assumed to be a
linear combination of known probability
distribution functions.
Model based clustering
• Algorithm optimizes a probabilistic model criterion
• Clustering is usually done by the Expectation Maximization (EM)
algorithm
– Gives a soft variant of the K-means algorithm
– Assume k clusters: {c1, c2,… ck}
– Assume a probabilistic model of categories that allows
computing P(ci | E) for each category, ci, for a given
example, E.
– For text, typically assume a naïve Bayes category
model.
– Parameters = {P(ci), P(wj | ci): i{1,…k}, j {1,…,|V|}}
Expectation Maximization (EM) Algorithm
r is the number of data points. As the number of data points increase the
confidence of the estimator increases.
MLE for Mixture Distributions
• When we proceed to calculate the MLE for a
mixture, the presence of the sum of the
distributions prevents a “neat” factorization
using the log function.
• A completely new rethink is required to
estimate the parameter.
• The new rethink also provides a solution to
the clustering problem.
A Mixture Distribution
Missing Data
• We think of clustering as a problem of
estimating missing data.
• The missing data are the cluster labels.
• Clustering is only one example of a missing
data problem. Several other problems can be
formulated as missing data problems.
Missing Data Problem
• Let D = {x(1),x(2),…x(n)} be a set of n
observations.
• Let H = {z(1),z(2),..z(n)} be a set of n values of
a hidden variable Z.
– z(i) corresponds to x(i)
• Assume Z is discrete.
EM Algorithm
• The EM Algorithm alternates between
maximizing F with respect to Q (theta fixed)
and then maximizing F with respect to theta
(Q fixed).
EM and K-means
• Notice the similarity between EM for Normal
mixtures and K-means.
Inference
Inputs
Density Prob- Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve DE,
Inputs
Regressor real no. Neural Net, N.Neigh, Kernel, LWR, RBFs, Robust
Regression, Cascade Correlation, Regression Trees,
GMDH, Multilinear Interp, MARS