0% found this document useful (0 votes)

36 views

Unit 2 - SVM

This document provides an overview of support vector machines (SVMs). It discusses the key concepts of SVMs, including: (1) using a max-margin classifier to formalize the best linear separator, (2) using Lagrangian multipliers to solve constrained optimization problems, and (3) projecting data into higher dimensions to make it linearly separable. It also briefly discusses kernels, complexity, and linear SVMs.

Uploaded by

nitin

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Unit 2 - SVM

Uploaded by

nitin

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 137

Unit 2 - SVM

SUPPORT VECTOR MACHINES

Main Idea
• Max-Margin Classifier
– Formalize notion of the best linear separator
• Lagrangian Multipliers
– Way to convert a constrained optimization problem to one
that is easier to solve
• Kernels
– Projecting data into higher-dimensional space makes it
linearly separable
• Complexity
– Depends only on the number of training examples, not on
dimensionality of the kernel space!
Text categorization
Tennis example

Temperature

Humidity
= play tennis
= do not play tennis
Linear Support Vector Machines
Data: <xi,yi>, i=1,..,l
xi  R d
yi  {-1,+1}

=+1
=-1

x1
Linear SVM 2

Data: <xi,yi>, i=1,..,l

xi  R d
yi  {-1,+1}

f(x) =-1
=+1

All hyperplanes in Rd are parameterize by a vector (w) and a constant b.

Can be expressed as w•x+b=0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane f(x)=sign(w•x+b), that
correctly classify our data.
Definitions
Define the hyperplane H such that:
xi•w+b  +1 when yi =+1 H1

xi•w+b  -1 when yi =-1

H2
H1 and H2 are the planes: d+
H1: xi•w+b = +1
d-
H2: xi•w+b = -1 H
The points on the planes H1
and H2 are the Support
Vectors
d+ = the shortest distance to the closest positive point
d- = the shortest distance to the closest negative point

The margin of a separating hyperplane is d+ + d-.

Definition
SVM
Maximizing the margin
We want a classifier with as big margin as possible.

H1
H
H2
Recall the distance from a point(x0,y0) to a line: d+
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||

The distance between H1 and H2 is: 2/||w||

In order to maximize the margin, we need to minimize ||w||. With the

condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1
Hard margin
Hard margin – Primal problem
Kernel function
ENSEMBLE OF CLASSIFIERS
Combining classifiers
• So far, we have only discussed
individual classifiers, i.e., how to
build them and use them.
• Can we combine multiple
classifiers to produce a better
classifier?
• Yes, sometimes
• Two approaches:
– Bagging
– Boosting
Bagging
 Breiman, 1996
 Bootstrap Aggregating = Bagging
 Application of bootstrap sampling
 Given: set D containing m training examples
 Create a sample S[i] of D by drawing m examples
at random with replacement from D
 S[i] of size m: expected to leave out 0.37 of
examples from D
Bagging (cont…)
 Training
 Create k bootstrap samples S[1], S[2], …, S[k]
 Build a distinct classifier on each S[i] to
produce k classifiers, using the same learning
algorithm.

 Testing
 Classify each new instance by voting of the k
classifiers (equal weights)
Bootstrap distribution
• The bootstrap does not replace or add to the
original data.

• We use bootstrap distribution as a way to

estimate the variation in a statistic based on
the original data.
Sampling distribution vs.
bootstrap distribution
• The population: certain unknown quantities of
interest (e.g., mean)

• Multiple samples  sampling distribution

• Bootstrapping:
– One original sample  B bootstrap samples
– B bootstrap samples  bootstrap distribution
• Bootstrap distributions usually approximate the
shape, spread, and bias of the actual sampling
distribution.

• Bootstrap distributions are centered at the value of

the statistic from the original sample plus any bias.

• The sampling distribution is centered at the value of

the parameter in the population, plus any bias.
Cases where bootstrap
does not apply
• Small data sets: the original sample is not a good
approximation of the population

• Dirty data: outliers add variability in our estimates.

• Dependence structures (e.g., time series, spatial

problems): Bootstrap is based on the assumption of
independence.

• …
Bagging Example
Original 1 2 3 4 5 6 7 8

Training set 1 2 7 8 3 7 6 3 1

Training set 2 7 8 5 6 4 2 7 1

Training set 3 3 6 2 7 5 6 2 2

Training set 4 4 5 1 4 6 4 3 8
Bagging (cont …)
• When does it help?
– When learner is unstable
• Small change to training set causes large
change in the output classifier
• True for decision trees, neural networks; not
true for k-nearest neighbor, naïve Bayesian,
class association rules
– Experimentally, bagging can help
substantially for unstable learners, may
somewhat degrade results for stable
learners
Bagging

 For i = 1 .. M
 Draw n*<n samples from D with replacement
 Learn classifier Ci
 Final classifier is a vote of C1 .. CM
 Increases classifier stability/reduces variance
Boosting
• A family of methods:
– We only study AdaBoost (Freund &
Schapire, 1996)
• Training
– Produce a sequence of classifiers (the
same base learner)
– Each classifier is dependent on the
previous one, and focuses on the
previous one’s errors
– Examples that are incorrectly predicted
in previous classifiers are given higher
weights
• Testing
– For a test case, the results of the series
of classifiers are combined to determine
the final class of the test case.
AdaBoost
Weighted called a weaker
classifier
training set
(x1, y1, w1)  Build a
(x2, y2, w2) classifier ht
… whose
(xn, yn, wn)
accuracy on
training set >
Non-negative
½ (better than
weights
random)
sum toChange
1
weights
AdaBoost algorithm
Bagging, Boosting and C4.5
C4.5’s mean
error rate
over the
10 cross-
validation.

Bagged C4.5
vs. C4.5.

Boosted C4.5
vs. C4.5.

Boosting vs.
Bagging
Does AdaBoost always work?
• The actual performance of boosting
depends on the data and the base
learner.
– It requires the base learner to be
unstable as bagging.
• Boosting seems to be susceptible to
noise.
– When the number of outliners is very
large, the emphasis placed on the hard
examples can hurt the performance.
What is Clustering?
• Attach label to each observation or data points in a set
• You can say this “unsupervised classification”
• Clustering is alternatively called as “grouping”
• Intuitively, if you would want to assign same label to a data
points that are “close” to each other
• Thus, clustering algorithms rely on a distance metric between
data points
• Sometimes, it is said that for clustering, the distance metric is
more important than the clustering algorithm
Distances: Quantitative Variables

Data point:
xi  [ xi1  xip ]T
Some examples
Types of clustering:
1. Hierarchical algorithms: these find successive clusters
using previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative algorithms
begin with each element as a separate cluster and merge them
into successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with the
whole set and proceed to divide it into successively smaller
clusters.
2. Partitional clustering: Partitional algorithms determine all clusters at
once. They include:
– K-means and derivatives
– Fuzzy c-means clustering
– QT clustering algorithm
Common Distance measures:

• Distance measure will determine how the similarity of two

elements is calculated and it will influence the shape of the
clusters.
They include:
1. The Euclidean distance (also called 2-norm distance) is given by:

2. The Manhattan distance (also called taxicab norm or 1-norm) is

given by:
3.The maximum norm is given by:

4. The Mahalanobis distance corrects data for

different scales and correlations in the variables.
5. Inner product space: The angle between two
vectors can be used as a distance measure when
clustering high dimensional data
6. Hamming distance (sometimes edit distance)
measures the minimum number of substitutions
required to change one member into another.
K-MEANS CLUSTERING
• The k-means algorithm is an algorithm to cluster n
objects based on attributes into k partitions, where
k < n.
• It is similar to the expectation-maximization
algorithm for mixtures of Gaussians in that they
both attempt to find the centers of natural clusters
in the data.
• It assumes that the object attributes form a vector
space.
• An algorithm for partitioning (or clustering) N
data points into K disjoint subsets Sj
containing data points so as to minimize the
sum-of-squares criterion

where xn is a vector representing the the nth

data point and uj is the geometric centroid of
the data points in Sj.
• Simply speaking k-means clustering is an
algorithm to classify or to group the objects
based on attributes/features into K number of
group.
• K is positive integer number.
• The grouping is done by minimizing the sum
of squares of distances between data and the
corresponding cluster centroid.
How the K-Mean Clustering algorithm
works?
• Step 1: Begin with a decision on the value of k =
number of clusters .
• Step 2: Put any initial partition that classifies the
data into k clusters. You may assign the training
samples randomly,or systematically as the
following:
1.Take the first k training sample as single- element
clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest
centroid. After each assignment, recompute the
centroid of the gaining cluster.
• Step 3: Take each sample in sequence and
compute its distance from the centroid of each of the
clusters. If a sample is not currently in the cluster with the
closest centroid, switch this sample to that cluster and
update the centroid of the cluster gaining the new sample
and the cluster losing the sample.
• Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.
A Simple example showing the implementation of
k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
Step 2:
• Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:
Step 3:
• Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.

• Therefore, the new clusters

are:
{1,2} and {3,4,5,6,7}

• Next centroids are:

m1=(1.25,1.5) and m2 =
(3.9,5.1)
• Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

• Therefore, there is no change

in the cluster.
• Thus, the algorithm comes to
a halt here and final result
consist of 2 clusters {1,2} and
{3,4,5,6,7}.
PLOT
(with K=3)

Step 1 Step 2
PLOT
How K-means partitions?

When K centroids are set/fixed,

they partition the whole data
space into K mutually
exclusive subspaces to form a
partition.

A partition amounts to a
Voronoi Diagram

Changing positions of centroids

leads to a new partitioning.
K-means: Pros & Cons
• Algorithmically, very simple to implement

• K-means converges, but it finds a local minimum of the cost

function

• Works only for numerical observations

• K is a user input; alternatively BIC (Bayesian information

criterion) or MDL (minimum description length) can be used
to estimate K

• Outliers can considerable trouble to K-means

Convergence
• Why should the K-means algorithm ever reach a
fixed point?
– A state in which clusters don’t change.
• K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
– EM is known to converge.
– Theoretically, number of iterations could be large.
– Typically converges quickly
Time Complexity
• Computing distance between doc and cluster
is O(m) where m is the dimensionality of the
vectors.
• Reassigning clusters: O(Kn) distance
computations, or O(Knm).
• Computing centroids: Each doc gets added
once to some centroid: O(nm).
• Assume these two steps are each done once
for I iterations: O(IKnm).
Seed Choice
• Results can vary based on
random seed selection.
• Some seeds can result in poor
convergence rate, or
convergence to sub-optimal
clusterings.
– Select good seeds using a
heuristic (e.g., doc least similar
to any existing mean)
– Try out multiple starting points
– Initialize with the results of
another method.
How Many Clusters?
• Number of clusters K is given
– Partition n docs into predetermined number of
clusters
• Finding the “right” number of clusters is part of
the problem
– Given data, partition into an “appropriate” number of
subsets.
– E.g., for query results - ideal value of K not known up
front - though UI may impose limits.
• Can usually take an algorithm for one flavor and
convert to the other.
K not specified in advance
• Say, the results of a query.
• Solve an optimization problem: penalize having
lots of clusters
– application dependent, e.g., compressed summary
of search results list.
• Tradeoff between having more clusters (better
focus within each cluster) and having too many
clusters
K not specified in advance
• Given a clustering, define the Benefit for a
doc to be some inverse distance to its
centroid
• Define the Total Benefit to be the sum of
the individual doc Benefits.
Penalize lots of clusters
• For each cluster, we have a Cost C.
• Thus for a clustering with K clusters, the Total
Cost is KC.
• Define the Value of a clustering to be =
Total Benefit - Total Cost.
• Find the clustering of highest value, over all
choices of K.
– Total benefit increases with increasing K. But can stop
when it doesn’t increase by “much”. The Cost term
enforces this.
K-medoids Clustering
• K-means is appropriate when we can work with Euclidean
distances
• Thus, K-means can work only with numerical, quantitative
variable types
• Euclidean distances do not work well in at least two situations
– Some variables are categorical
– Outliers can be potential threats
• A general version of K-means algorithm called K-medoids can
work with any distance measure
• K-medoids clustering is computationally more intensive
K-medoids Algorithm
• Step 1: For a given cluster assignment C, find the observation
in the cluster minimizing the total distance to other points in
that cluster:
ik  arg min  d ( x , x ).
{i:C ( i )  k } C ( j )  k
i j

• Step 2: Assign mk  xi , k  1,2,, K

• Step 3: Given a set of cluster centers {m1, …, mK}, minimize the

total error by assigning each observation to the closest
(current) cluster center:
C (i)  arg min d ( xi , mk ), i  1,, N
1 k  K
• Iterate steps 1 to 3
K-medoids Summary

• Generalized K-means
• Computationally much costlier that K-means
• Apply when dealing with categorical data
• Apply when data points are not available, but only
pair-wise distances are available
• Converges to local minimum
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

How could you do this with k-means?

Hierarchical Clustering algorithms

• Agglomerative (bottom-up):
– Start with each document being a single cluster.
– Eventually all documents belong to the same cluster.

• Divisive (top-down):
– Start with all documents belong to the same cluster.
– Eventually each node forms a cluster on its own.
– Could be a recursive application of k-means like algorithms

• Does not require the number of clusters k in advance

• Needs a termination/readout condition
Hierarchical Agglomerative Clustering (HAC)

• Assumes a similarity function for determining

the similarity of two instances.
• Starts with all instances in a separate cluster
and then repeatedly joins the two clusters
that are most similar until there is only one
cluster.
• The history of merging forms a binary tree or
hierarchy.
Dendogram: Hierarchical Clustering

• Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
Hierarchical Agglomerative Clustering
(HAC)
• Starts with each doc in a separate cluster
– then repeatedly joins the closest pair of
clusters, until there is only one cluster.
• The history of merging forms a binary tree
or hierarchy.
How to measure distance of clusters??
Closest pair of clusters
Many variants to defining closest pair of clusters
• Single-link
– Distance of the “closest” points (single-link)
• Complete-link
– Distance of the “furthest” points
• Centroid
– Distance of the centroids (centers of gravity)
• (Average-link)
– Average distance between pairs of elements
Single Link Agglomerative Clustering
• Use maximum similarity of pairs:

sim (ci ,c j )  max sim ( x, y )

xci , yc j
• Can result in “straggly” (long and thin) clusters
due to chaining effect.
• After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is:

sim ((ci  c j ), ck )  max( sim (ci , ck ), sim (c j , ck ))

Single Link Example
Example Single link
X Y
P1 0.4 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
P1 P2 P3 P4 P5 P6

P1 0

P2 0.23 0

P3 0.22 0.15 0

P4 0.37 0.20 0.15 0

P5 0.34 0.14 0.28 0.29 0

P6 0.23 0.25 0.11 0.22 0.39 0

Merge
• P3, P6 is small – so merge them as one cluster
• Distance matrix is updated as
– Min (Dist(P3,P1), Dist(P6,P1)
– = 0.22
P1 P2 P3,P6 P4 P5 P6

P1 0

P2 0.23 0

P3 P6 0.22 0.15 0

P4 0.37 0.20 0.15 0

P5 0.34 0.14 0.28 0.29 0

Merge
• P2, P5 is small – so merge them as one cluster
• Distance matrix is updated as
– Min (Dist(P2,P1), Dist(P5,P1)
– = 0.22
P1 P2, P5 P3,P6 P4

P1 0

P2, P5 0.23 0

P3 P6 0.22 0.15 0

P4 0.37 0.20 0.15 0

Merge
• P2P5 , P3P6 is small – so merge them as one
cluster
• Distance matrix is updated as
– Min (Dist(P2,P1), Dist(P5,P1)
– = 0.22
P1 P2, P5 P3 P4
P6
P1 0

P2, P5 , 0.22 0
P3P6
P4 0.37 0.15 0
Merge
• P2P5P3P6, P4 is small – so merge them as one
cluster

P1 P2, P5 P3
P6P4
P1 0

P2, P5 , 0.22 0
P3P6 P4
Complete Link Agglomerative
Clustering
• Use minimum similarity of pairs:
sim (ci ,c j )  min sim ( x, y)
xci , yc j

• Makes “tighter,” spherical clusters that are

typically preferable.
• After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is:
sim ((ci  c j ), ck )  min( sim (ci , ck ), sim (c j , ck ))
Ci Cj Ck
Complete Link Example
Example Complete Link
P1 P2 P3 P4 P5 P6

P1 0

P2 0.23 0

P3 0.22 0.15 0

P4 0.37 0.20 0.15 0

P5 0.34 0.14 0.28 0.29 0

P6 0.23 0.25 0.11 0.22 0.39 0

Merge
• Max (Dist(P3,P1), Dist(P6,P1) = 0.23
P1 P2 P3P6 P4 P5

P1 0

P2 0.23 0

P3P6 0.23 0.25 0

P4 0.37 0.20 0.22 0

P5 0.34 0.14 0.39 0.29 0

Merge

P1 P2P5 P3P6 P4

P1 0

P2P5 0.34 0

P3P6 0.23 0.39 0

P4 0.37 0.29 0.22 0

Merge

P1 P2P5 P3P6P4

P1 0

P2P5 0.34 0

P3P6P4 0.37 0.39 0

Merge

P1P2P5 P3P6P4

P1P2P5 0

P3P6P4 0.39 0
Key notion: cluster representative
• We want a notion of a representative point in
a cluster
• Representative should be some sort of
“typical” or central point in the cluster, e.g.,
– point inducing smallest radii to docs in cluster
– smallest squared distances, etc.
– point that is the “average” of all docs in the cluster
• Centroid or center of gravity
Centroid-based Similarity

• Always maintain average of vectors in each cluster:



x

xc j
s (c j ) 
cj
• Compute similarity of clusters by:

sim (ci , c j )  sim ( s(ci ), s(c j ))

• For non-vector data, can’t always make a centroid

Computational Complexity
• In the first iteration, all HAC methods need to
compute similarity of all pairs of n individual
instances which is O(mn2).
• In each of the subsequent n2 merging
iterations, compute the distance between the
most recently created cluster and all other
existing clusters.
• Maintaining of heap of distances allows this to
be O(mn2logn)
Major issue - labeling
• After clustering algorithm finds clusters - how
can they be useful to the end user?
• Need pithy label for each cluster
– In search results, say “Animal” or “Car” in the
jaguar example.
– In topic trees, need navigational cues.
• Often done by hand, a posteriori.

How would you do this?

How to Label Clusters
• Show titles of typical documents
– Titles are easy to scan
– Authors create them for quick scanning!
– But you can only show a few titles which may not fully
represent cluster
• Show words/phrases prominent in cluster
– More likely to fully represent cluster
– Use distinguishing words/phrases
• Differential labeling
– But harder to scan
Labeling
• Common heuristics - list 5-10 most frequent
terms in the centroid vector.
– Drop stop-words; stem.
• Differential labeling by frequent terms
– Within a collection “Computers”, clusters all have
the word computer as frequent term.
– Discriminant analysis of centroids.

• Perhaps better: distinctive noun phrase

Expensive Distance Metric for Text

• String edit distance S e c a t

• Compute with dynamic
programming 0.0 0.7 1.4 2.1 2.8 3.5
• Costs for character: S 0.7 0.0 0.7 1.1 1.4 1.8
– insertion
– deletion c 1.4 0.7 1.0 0.7 1.4 1.8
– substitution
o 2.1 1.1 1.7 1.4 1.7 2.4
– ...
t 2.8 1.4 2.1 1.8 2.4 1.7
t 3.5 1.8 2.4 2.1 2.8 2.4
String edit (Levenstein) distance
• Distance is shortest sequence of edit
commands that transform s to t.
• Simplest set of operations:
– Copy character from s over to t
– Delete a character in s (cost 1)
– Insert a character in t (cost 1)
– Substitute one character for another (cost 1)
Levenstein distance - example
• distance(“William Cohen”, “Willliam Cohon”)

s W I L L I A M _ C O H E N

t W I L L L I A M _ C O H O N

op C C C C I C C C C C C C S C

cost 0 0 0 0 1 1 1 1 1 1 1 1 2 2
Levenstein distance - example
• distance(“William Cohen”, “Willliam Cohon”)

s W I L L gap I A M _ C O H E N

t W I L L L I A M _ C O H O N

op C C C C I C C C C C C C S C

cost 0 0 0 0 1 1 1 1 1 1 1 1 2 2
Computing Levenshtein distance

D(i,j) = score of best alignment from s1..si to t1..tj

D(i-1,j-1), if si=tj //copy

D(i-1,j-1)+1, if si!=tj //substitute
= min D(i-1,j)+1 //insert
D(i,j-1)+1 //delete
Algorithm
for i = 0 to strlen(s)
{ for j = 0 to strlen(t)
{ edit = 1;
if (s[i] == t[j])
edit = 0;
distance[i+1][j+1] =
min{ distance[i][j+1] + 1,
distance[i+1][j] + 1,
distance[i][j] + edit }
}
}
Computing Levenstein distance

C O H E N
M 1 2 3 4 5
C 1 2 3 4 5
C 2 2 3 4 5
O 3 2 3 4 5
H 4 3 2 3 4
N 5 4 3 3 3
= D(s,t)
Computing Levenstein distance

C O H E N
A trace indicates where 2 3 4 5
M 1
the min value came
from, and can be used to C 2 3 4 5
find edit operations
1
and/or a best alignment C 2 3 3 4 5
(may be more than 1)
O 3 2 3 4 5

H 4 3 2 3 4

N 5 4 3 3 3
Large Clustering Problems
• Many examples
• Many clusters
• Many dimensions

Example Domains
 Text
 Images
 Protein structure
EXPECTATION MAXIMIZATION
ALGORITHM (EM)
Model Space
• The choice of the model space is plentiful but
not unlimited.
• There is a bit of “art” in selecting the
appropriate model space.
• Typically the model space is assumed to be a
linear combination of known probability
distribution functions.
Model based clustering
• Algorithm optimizes a probabilistic model criterion
• Clustering is usually done by the Expectation Maximization (EM)
algorithm
– Gives a soft variant of the K-means algorithm
– Assume k clusters: {c1, c2,… ck}
– Assume a probabilistic model of categories that allows
computing P(ci | E) for each category, ci, for a given
example, E.
– For text, typically assume a naïve Bayes category
model.
– Parameters  = {P(ci), P(wj | ci): i{1,…k}, j {1,…,|V|}}
Expectation Maximization (EM) Algorithm

• Iterative method for learning probabilistic categorization

model from unsupervised data.
• Initially assume random assignment of examples to
categories.
• Learn an initial probabilistic model by estimating model
parameters  from this randomly labeled data.
• Iterate following two steps until convergence:
– Expectation (E-step): Compute P(ci | E) for each example given the
current model, and probabilistically re-label the examples based on
these posterior probability estimates.
– Maximization (M-step): Re-estimate the model parameters, , from
the probabilistically re-labeled data.
Examples
• Suppose we have the following data
– 0,1,1,0,0,1,1,0
• In this case it is sensible to choose the
Bernoulli distribution (B(p)) as the model
space.

• Now we want to choose the best p, i.e.,

Examples
Suppose the following are marks in a course
55.5, 67, 87, 48, 63
Marks typically follow a Normal distribution
whose density function is

Now, we want to find the best , such that

Examples
• Suppose we have data about heights of
people (in cm)
– 185,140,134,150,170
• Heights follow a normal (log normal)
distribution but men on average are taller
than women. This suggests a mixture of two
distributions
Maximum Likelihood Estimation
• We have reduced the problem of selecting the best
model to that of selecting the best parameter.
• We want to select a parameter p which will
maximize the probability that the data was
generated from the model with the parameter p
plugged-in.
• The parameter p is called the maximum likelihood
estimator.
• The maximum of the function can be obtained by
setting the derivative of the function ==0 and solving
for p.
Two Important Facts
• If A1,,An are independent then

• The log function is monotonically increasing. x

· y ! Log(x) · Log(y)

• Therefore if a function f(x) >= 0, achieves a

maximum at x1, then log(f(x)) also achieves a
maximum at x1.
Example of MLE

• Now, choose p which maximizes L(p). Instead we will

maximize l(p)= LogL(p)
Properties of MLE
• There are several technical properties of the
estimator but lets look at the most intuitive
one:
– As the number of data points increase we become
more sure about the parameter p
Properties of MLE

r is the number of data points. As the number of data points increase the
confidence of the estimator increases.
MLE for Mixture Distributions
• When we proceed to calculate the MLE for a
mixture, the presence of the sum of the
distributions prevents a “neat” factorization
using the log function.
• A completely new rethink is required to
estimate the parameter.
• The new rethink also provides a solution to
the clustering problem.
A Mixture Distribution
Missing Data
• We think of clustering as a problem of
estimating missing data.
• The missing data are the cluster labels.
• Clustering is only one example of a missing
data problem. Several other problems can be
formulated as missing data problems.
Missing Data Problem
• Let D = {x(1),x(2),…x(n)} be a set of n
observations.
• Let H = {z(1),z(2),..z(n)} be a set of n values of
a hidden variable Z.
– z(i) corresponds to x(i)
• Assume Z is discrete.
EM Algorithm
• The EM Algorithm alternates between
maximizing F with respect to Q (theta fixed)
and then maximizing F with respect to theta
(Q fixed).
EM and K-means
• Notice the similarity between EM for Normal
mixtures and K-means.

• The expectation step is the assignment.

• The maximization step is the update of
centers.
EM intuition
• What if we were given only results of the coin
tosses?
• Can we guess the % of heads that each coin
yields?
• Can we guess which coin was picked for each
set of 10 coin tosses
EM algorithm
• Assign random averages to both coins
• For each of the 5 rounds of 10 coin tosses
– Check the percentage of heads
– Find the probability of it coming from each coin
– Compute the expected number of heads using the
probability and multiply by number of heads
– Record and re-compute new mean for coin A and
B
The GMM assumption
• There are k components. The i’th
component is called wi
• Component wi has an associated mean
vector i
• Each component generates data from 2
a Gaussian with mean i and
covariance matrix 2I 1
Assume that each datapoint is generated
according to the following recipe:
3
The GMM assumption
• There are k components. The i’th
component is called wi
• Component wi has an associated mean
vector i
• Each component generates data from 2
a Gaussian with mean i and
covariance matrix 2I
Assume that each datapoint is generated
according to the following recipe:
1. Pick a component at random. Choose
component i with probability P(wi).
The GMM assumption
• There are k components. The i’th
component is called wi
• Component wi has an associated mean
vector i
• Each component generates data from 2
a Gaussian with mean i and
covariance matrix 2I
x
Assume that each datapoint is generated
according to the following recipe:
1. Pick a component at random. Choose
component i with probability P(wi).
2. Datapoint ~ N(i, 2I )
The General GMM assumption
• There are k components. The i’th
component is called wi
• Component wi has an associated mean
vector i
• Each component generates data from 2
a Gaussian with mean i and
covariance matrix Si 1
Assume that each datapoint is generated
according to the following recipe:
1. Pick a component at random. Choose 3
component i with probability P(wi).
2. Datapoint ~ N(i, Si )
Mixtures of Gaussians
• K-means algorithm
– Assigned each example to exactly one cluster
– What if clusters are overlapping?
• Hard to tell which cluster is right
• Maybe we should try to remain uncertain
– Used Euclidean distance
– What if cluster has a non-circular shape?

• Gaussian mixture models

– Clusters modeled as multivariate Gaussians
• Not just by their mean
– EM algorithm: assign data to
cluster with some probability
The old trick…

Inference
Inputs

P(E1|E2) Joint DE, Bayes Net Structure Learning

Engine Learn

Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,

Predict
Inputs

Classifier Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes Net

category
Based BC, Cascade Correlation, GMM-BC

Density Prob- Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve DE,
Inputs

Estimator ability Bayes Net Structure Learning, GMMs

Predict Linear Regression, Polynomial Regression, Perceptron,

Inputs

Regressor real no. Neural Net, N.Neigh, Kernel, LWR, RBFs, Robust
Regression, Cascade Correlation, Regression Trees,
GMDH, Multilinear Interp, MARS

Project Submission Machine Learning - Ankit Bhagat - 8th Jan
100% (9)
Project Submission Machine Learning - Ankit Bhagat - 8th Jan
36 pages
SVM Using Python
No ratings yet
SVM Using Python
24 pages
Session 5 ppt
No ratings yet
Session 5 ppt
36 pages
SVM Presentation
No ratings yet
SVM Presentation
27 pages
AI Chapter 3 Part 3
No ratings yet
AI Chapter 3 Part 3
49 pages
IDS26 Clustering and Classification
No ratings yet
IDS26 Clustering and Classification
30 pages
ML - Interview Prep
No ratings yet
ML - Interview Prep
9 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
Classification (NaiveBayes KNN SVM DecisionTrees)
No ratings yet
Classification (NaiveBayes KNN SVM DecisionTrees)
105 pages
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
No ratings yet
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
34 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
9 Svm-Handout PDF
No ratings yet
9 Svm-Handout PDF
21 pages
Unit 3 in Machine Intelligence
No ratings yet
Unit 3 in Machine Intelligence
62 pages
SVM Scribe Notes
No ratings yet
SVM Scribe Notes
16 pages
Module 3
No ratings yet
Module 3
79 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
AP for NLP-LO2
No ratings yet
AP for NLP-LO2
38 pages
SVM Class
No ratings yet
SVM Class
33 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
ML unit-2 (CEC)
No ratings yet
ML unit-2 (CEC)
96 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Unsupervised ML Clustering
No ratings yet
Unsupervised ML Clustering
15 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Mod09-ppt2-ML_in_Image_Classification
No ratings yet
Mod09-ppt2-ML_in_Image_Classification
30 pages
SVM
No ratings yet
SVM
40 pages
UNIT - 2
No ratings yet
UNIT - 2
15 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
46 pages
MergedPDF Iml
No ratings yet
MergedPDF Iml
114 pages
Final - Support Vector Machine - Class - Modifie
No ratings yet
Final - Support Vector Machine - Class - Modifie
69 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
SVM.pptx
No ratings yet
SVM.pptx
67 pages
Understanding Machine Learning Algorithms - in Depth
No ratings yet
Understanding Machine Learning Algorithms - in Depth
167 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
Pattern Analysis-Machine Learning
No ratings yet
Pattern Analysis-Machine Learning
74 pages
03 Classification
No ratings yet
03 Classification
66 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
K-SVM: An Effective SVM Algorithm Based On K-Means Clustering
No ratings yet
K-SVM: An Effective SVM Algorithm Based On K-Means Clustering
8 pages
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
100% (1)
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
13 pages
cs221-lecture11
No ratings yet
cs221-lecture11
71 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
Artificial Intelligence and Machine Learning: T.A. Silvia Bucci
No ratings yet
Artificial Intelligence and Machine Learning: T.A. Silvia Bucci
78 pages
Machine Learning in A Nutshell
No ratings yet
Machine Learning in A Nutshell
36 pages
13 PracticalMachineLearning
100% (1)
13 PracticalMachineLearning
84 pages
Topic 08 - Data Modelling - Part II
No ratings yet
Topic 08 - Data Modelling - Part II
59 pages
FAM_QUESTION_BANK_CT[1]
No ratings yet
FAM_QUESTION_BANK_CT[1]
14 pages
08classification I
No ratings yet
08classification I
52 pages
Introduction To (Statistical) Machine Learning
No ratings yet
Introduction To (Statistical) Machine Learning
30 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
Unit 4 Introduction to Algorithm
No ratings yet
Unit 4 Introduction to Algorithm
10 pages
CZ4032 Data Analytics & Mining Notes
No ratings yet
CZ4032 Data Analytics & Mining Notes
16 pages
SWE622 Lecture 3 Classification
No ratings yet
SWE622 Lecture 3 Classification
57 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Day 4 Content
No ratings yet
Day 4 Content
35 pages
SVM
No ratings yet
SVM
11 pages
Basic of SVM Algorithm
No ratings yet
Basic of SVM Algorithm
10 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
MSC Salunkhe T P 2018
No ratings yet
MSC Salunkhe T P 2018
79 pages
Uber Data Analysis
No ratings yet
Uber Data Analysis
22 pages
Over-Sampling Algorithm For Imbalanced Data Classification: XU Xiaolong, Chen Wen, and SUN Yanfei
No ratings yet
Over-Sampling Algorithm For Imbalanced Data Classification: XU Xiaolong, Chen Wen, and SUN Yanfei
10 pages
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
Naïve Bayes-DecisionTrees-RandomForest-SVM
No ratings yet
Naïve Bayes-DecisionTrees-RandomForest-SVM
26 pages
Thesis Machine Learning
No ratings yet
Thesis Machine Learning
29 pages
Deepfake Audio Detection Via MFCC Features Using M
No ratings yet
Deepfake Audio Detection Via MFCC Features Using M
11 pages
SQL Injection Detection Using Machine Learning
No ratings yet
SQL Injection Detection Using Machine Learning
51 pages
fmede-02-1369265
No ratings yet
fmede-02-1369265
23 pages
Data Science: The Hard Parts: Techniques for Excelling at Data Science 1 / converted Edition Daniel Vaughan instant download
100% (1)
Data Science: The Hard Parts: Techniques for Excelling at Data Science 1 / converted Edition Daniel Vaughan instant download
61 pages
Ensemble Learning
No ratings yet
Ensemble Learning
35 pages
3 Comprehensive Review On Machine Learning Applications in Cloud Computing
No ratings yet
3 Comprehensive Review On Machine Learning Applications in Cloud Computing
9 pages
Machine Learning Based Intrusion Detection Systems for IoT Applications
No ratings yet
Machine Learning Based Intrusion Detection Systems for IoT Applications
24 pages
A New Efficient Decoder of Linear Block Codes Based On Ensemble Learning Methods
No ratings yet
A New Efficient Decoder of Linear Block Codes Based On Ensemble Learning Methods
11 pages
1.exploring Unsupervised Machine Learning
No ratings yet
1.exploring Unsupervised Machine Learning
12 pages
Cognitive Analytics Platform With AI Solutions For Anomaly Detection
No ratings yet
Cognitive Analytics Platform With AI Solutions For Anomaly Detection
17 pages
Zhu - Multiclass Adaboost2009 PDF
No ratings yet
Zhu - Multiclass Adaboost2009 PDF
12 pages
AI and IoT Based Monitoring System For Increasing The Yield in Crop Production
No ratings yet
AI and IoT Based Monitoring System For Increasing The Yield in Crop Production
5 pages
What Are The Types of Machine Learning?
100% (1)
What Are The Types of Machine Learning?
24 pages
18AI61
No ratings yet
18AI61
3 pages
Conference Paper 28-02
No ratings yet
Conference Paper 28-02
8 pages
Deep Learning PIAIC
100% (1)
Deep Learning PIAIC
229 pages
MLQB Unit 3
No ratings yet
MLQB Unit 3
12 pages
ResearchPaper
No ratings yet
ResearchPaper
18 pages
ML Inter Q&A
No ratings yet
ML Inter Q&A
54 pages
Instant Download Multi disciplinary Trends in Artificial Intelligence 10th International Workshop MIWAI 2016 Chiang Mai Thailand December 7 9 2016 Proceedings 1st Edition Chattrakul Sombattheera PDF All Chapters
100% (4)
Instant Download Multi disciplinary Trends in Artificial Intelligence 10th International Workshop MIWAI 2016 Chiang Mai Thailand December 7 9 2016 Proceedings 1st Edition Chattrakul Sombattheera PDF All Chapters
52 pages
Get Ensemble Learning for AI Developers Learn Bagging Stacking and Boosting Methods with Use Cases Alok Kumar Mayank Jain PDF ebook with Full Chapters Now
100% (3)
Get Ensemble Learning for AI Developers Learn Bagging Stacking and Boosting Methods with Use Cases Alok Kumar Mayank Jain PDF ebook with Full Chapters Now
47 pages
Heart Disease Prediction With Machine Learning
No ratings yet
Heart Disease Prediction With Machine Learning
11 pages
Effective and Efficient Approach in IoT Botnet Detection
No ratings yet
Effective and Efficient Approach in IoT Botnet Detection
12 pages

Unit 2 - SVM

Uploaded by

Unit 2 - SVM

Uploaded by

Unit 2 - SVM

SUPPORT VECTOR MACHINES

Data: <xi,yi>, i=1,..,l

All hyperplanes in Rd are parameterize by a vector (w) and a constant b.

xi•w+b  -1 when yi =-1

The margin of a separating hyperplane is d+ + d-.

The distance between H1 and H2 is: 2/||w||

In order to maximize the margin, we need to minimize ||w||. With the

• We use bootstrap distribution as a way to

• Multiple samples  sampling distribution

• Bootstrap distributions are centered at the value of

• The sampling distribution is centered at the value of

• Dirty data: outliers add variability in our estimates.

• Dependence structures (e.g., time series, spatial

• Distance measure will determine how the similarity of two

2. The Manhattan distance (also called taxicab norm or 1-norm) is

4. The Mahalanobis distance corrects data for

where xn is a vector representing the the nth

• Therefore, the new clusters

• Next centroids are:

• Therefore, there is no change

When K centroids are set/fixed,

Changing positions of centroids

• K-means converges, but it finds a local minimum of the cost

• Works only for numerical observations

• K is a user input; alternatively BIC (Bayesian information

• Outliers can considerable trouble to K-means

• Step 2: Assign mk  xi , k  1,2,, K

• Step 3: Given a set of cluster centers {m1, …, mK}, minimize the

fish reptile amphib. mammal worm insect crustacean

How could you do this with k-means?

• Does not require the number of clusters k in advance

• Assumes a similarity function for determining

sim (ci ,c j )  max sim ( x, y )

sim ((ci  c j ), ck )  max( sim (ci , ck ), sim (c j , ck ))

P4 0.37 0.20 0.15 0

P5 0.34 0.14 0.28 0.29 0

P6 0.23 0.25 0.11 0.22 0.39 0

P4 0.37 0.20 0.15 0

P5 0.34 0.14 0.28 0.29 0

P4 0.37 0.20 0.15 0

• Makes “tighter,” spherical clusters that are

P4 0.37 0.20 0.15 0

P5 0.34 0.14 0.28 0.29 0

P6 0.23 0.25 0.11 0.22 0.39 0

P3P6 0.23 0.25 0

P4 0.37 0.20 0.22 0

P5 0.34 0.14 0.39 0.29 0

P3P6 0.23 0.39 0

P4 0.37 0.29 0.22 0

P3P6P4 0.37 0.39 0

• Always maintain average of vectors in each cluster:

sim (ci , c j )  sim ( s(ci ), s(c j ))

• For non-vector data, can’t always make a centroid

How would you do this?

• Perhaps better: distinctive noun phrase

• String edit distance S e c a t

D(i,j) = score of best alignment from s1..si to t1..tj

D(i-1,j-1), if si=tj //copy

• Iterative method for learning probabilistic categorization

• Now we want to choose the best p, i.e.,

Now, we want to find the best , such that

• The log function is monotonically increasing. x

• Therefore if a function f(x) >= 0, achieves a

• Now, choose p which maximizes L(p). Instead we will

• The expectation step is the assignment.

• Gaussian mixture models

P(E1|E2) Joint DE, Bayes Net Structure Learning

Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,

Classifier Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes Net

Estimator ability Bayes Net Structure Learning, GMMs

Predict Linear Regression, Polynomial Regression, Perceptron,

You might also like