0% found this document useful (0 votes)
13 views

CS-12

Uploaded by

rofoxov186
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

CS-12

Uploaded by

rofoxov186
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Machine Learning

AIML CZG565
Unsupervised Learning

BITS Pilani Course Faculty of M.Tech Cluster


BITS – CSIS - WILP
Pilani Campus
Machine Learning
Disclaimer and Acknowledgement

•These content of modules & context under topics are planned by the course owner
Dr. Sugata, with grateful acknowledgement to many others who made their course
materials freely available online
•We here by acknowledge all the contributors for their material and inputs.
•We have provided source information wherever necessary
•Students are requested to refer to the textbook w.r.t detailed content of the
presentation deck shared over canvas
•We have reduced the slides from canvas and modified the content flow to suit the
requirements of the course and for ease of class presentation

Slide Source / Preparation / Review:


From BITS Pilani WILP: Profs.Sugata, Chetana, Rajavadhana, Monali, Seetha, Vimal, Swarna

External: CS109 and CS229 Stanford lecture notes, Dr.Andrew NG and many others who
made their course materials freely available online
BITS Pilani, Pilani Campus
Course Plan

M1 Introduction & Mathematical Preliminaries

M2 Machine Learning Workflow

M3 Linear Models for Regression

M4 Linear Models for Classification

M5 Decision Tree

M6 Instance Based Learning

M7 Support Vector Machine

M8 Bayesian Learning

M9 Ensemble Learning

M10 Unsupervised Learning

M11 Machine Learning Model Evaluation/Comparison


BITS Pilani, Pilani Campus
Module –Unsupervised Learning

• Mixture Models
• Expectation Maximization (EM) Algorithm

• K-means Clustering

BITS Pilani, Pilani Campus


Unsupervised discovery Task : Group the objects based on the
characterisitics features

Input:
Structured data of features like colour , shape,
size, etc.,

Process: Find the similarity/dissimlarity between


objects with features as the defining dimensions

Output:
Group/Cluster of objects

BITS Pilani, Pilani Campus


K-Means

6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
K-Means Algorithm
• Works iteratively to find {𝜇k}and {rnk} such that J is minimized

• Iteration involves two key steps


• Find {rnk} , fixing {𝜇k} to minimize J
• Find {𝜇k} , fixing {rnk} to minimize J
• Let us look at each of these steps

Binary Indicator variable that indicates point xn belongs to which cluster

BITS Pilani, Pilani Campus


K-Means Algorithm
A sample E-Step

BITS Pilani, Pilani Campus


K-Means Algorithm
A sample E-Step
Expectation Step

Initialize random clusters

Two - Phases:
1. Assign / Re-assign data points to clusters based on minimum
distance to cluster centers
1. Re-compute the cluster means

Repeat steps 1-2 till convergence

BITS Pilani, Pilani Campus


K-Means Algorithm
A sample E-Step

BITS Pilani, Pilani Campus


K-Means Algorithm
E-Step:

For all xt ∊ X:

E-step: each data point is assigned to one of


the two clusters

BITS Pilani, Pilani Campus


K-Means Algorithm

M-Step:

For all 𝜇k[where k = 1,2,...,K ] :

re-computign the cluster center as mean of the


points assigned to the corresponding cluster

BITS Pilani, Pilani Campus


K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

BITS Pilani, Pilani Campus


K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

E-Step in the second iteration

BITS Pilani, Pilani Campus


K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

M-Step in the second iteration

BITS Pilani, Pilani Campus


K-Means Algorithm

Algorithm:

Initialize 𝜇k[where k = 1,2,...,K ]


Repeat
E-Step [as defined earlier]
M-Step [as defined earlier]
Until convergence of 𝜇k .

M-Step in the second iteration

BITS Pilani, Pilani Campus


Example
Consider the analysis of weights of individuals and their respective blood glucose levels as given
below:
a) Identify the clusters using K-means clustering (k=2) for the given data, assuming candidate
1 and 2 as initial centroids.
b) How many iterations does it take before termination?

Candidate
Glucose
Weight level
1 72 185
2 56 170
3 60 168
4 68 179
5 72 182
6 77 188
7 70 180
8 84 183

17

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example
Candidate Cluster1(72, 185) Cluster2(56, 170)
3(60, 168) =172 +122 = 22+42
4(68 179) =62+42 = 92+122
5(72, 182) = 32+02 = 122+162
6(77, 188) = 32+52 = 182+112
7(70, 180) = 52+22 = 102+142
8(183, 84) = 22+122 = 132+282

Candidat Cluster1(182.83, Cluster2(169, 58)


After 1st iteration, the cluster groups are C1{1, 4, 5, 6, 7, e 73.83)
8} and C2{2, 3} 1(185, 72) (182.83 -185)2 +(73.83 -72)2 (169 -185)2 +(58 -72)2
Re-computing centroids: =2.172 +1.832 = 162+142
C1( [72+68+72+77+70+84]/6, [185 2(170, 56) (182.83 -170)2 +(73.83 -56)2 (169 -170)2 +(58 -56)2
+179+182+188+180+183]/6 ) =C1(73.83, 182.83,) =12.832 +17.832 = 12+22
C2( [56 +60]/2, [170+168]/2) =C2(58, 169) 3(168, 60) (182.83 -168)2 +(73.83 -60)2 (169 -168)2 +(58 -60)2

=14.832 +13.832 = 12+22


2nd Iteration –computing distances from resulting 4(179, 68) (182.83 -179)2 +(73.83 -68)2 (169 -179)2 +(58 -68)2
centroids: =3.832 +5.832 = 102+102
5(182, 72) (182.83 -182)2 +(73.83 -72)2 (169 -182)2 +(58 -72)2
After 2nd iteration, the cluster groups are C1{1, 4, 5, 6, 7,
=0.832 +1.832 = 132+142
8} and C2{2, 3}; there is no change in cluster groups
6(188, 77) (182.83 -188)2 +(73.83 -77)2 (169 -188)2 +(58 -77)2
hence the algorithm terminates.
=5.172 +3.172 = 192+192
7(180, 70) (182.83 -180)2 +(73.83 -70)2 (169 -180)2 +(58 -70)2

=2.172 +1.832 = 112+122


8(183, 84) (182.83 -183)2 +(73.83 -84)2 (169 -183)2 +(58 -84)2
18
2 2 2 2
= 0.17 +10.17 = 14 +26

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.

– x is a data point in cluster Ci and mi is the representative point for cluster Ci


• can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest error
– One easy way to reduce SSE is to increase K, the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K

19

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids

20

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids

21

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids …

22

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids …

23

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means
• K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-globular shapes

• K-means has problems when the data contains outliers.

24

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

25

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

26

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

27

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


From K-Means to K-Medoids
• The k-means is sensitive to outliers-since an object with an
extremely large value may substantially distort the distribution of the
data
• K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster
• K-Medoids

BITS Pilani, Pilani Campus


How to choose a “good” k
for k-means clustering?
• Choose the number of clusters by visually inspecting data points
• Elbow method:
• Compute the sum of squared error (SSE) for some values of k (for example,
2,3,4,5,6…). The SSE is defined as the sum of the squared distance between each
member of the cluster and its centroid.

• If you plot k against SSE, you will see that the error decreases as k gets larger, this
is because when the number of clusters increases, they should be smaller, so
distortion is also smaller. The idea of the elbow method is to choose the k at which
the SSE decreases abruptly.
BITS Pilani, Pilani Campus
(naive) K-Means for detecting
outliers
• Let
• dist(x, 𝜇k) be the distance of a
point x, assigned to cluster k to its
center 𝜇k.
• L𝜇k be the the average distance
of all the points assigned to
cluster k with its center

• The ratio dist(x, 𝜇k) / L𝜇k for each


point is the outlier score for each
point.
• Higher the ratio for a point x,
more likely x is an outlier

BITS Pilani, Pilani Campus


Notion of Mixture Models

BITS Pilani
K means – Hard clustering

• Assigned each example to exactly one cluster


• What if clusters are overlapping?
– Hard to tell which cluster is right
– Maybe we should try to remain uncertain
• What if cluster has a non-circular shape?
• Hard assignments of data points to clusters – small shift of a data point can
flip it to a different cluster
• For example, if a point is near the 'border' between two clusters, it's often
better to know that it has near equal membership probabilities for these
clusters, rather than blindly assigning it to the nearest one.

0 190 255
intensity
BITS Pilani, Pilani Campus
1 of K coding mechanism

• For each data point xn, we introduce a set of binary indicator variables
rnk ∈ [0,1] such that

where k =1,...,K describing which of the K clusters the data point xn is


assigned to, so that if data point xn is assigned to cluster k then rnk > rnj for j
not equal to k.
• Example: 5 data points and 3 clusters

BITS Pilani, Pilani Campus


GMM - soft Clustering

• Solution: replace ‘hard’ clustering of K-means with ‘soft’ probabilistic


assignments
• Probabilistic Clustering
• Represents the probability distribution of the data as a Gaussian
mixture model (GMM)
• GMMs give a probabilistic assignment of points to clusters which
quantify uncertainty.
• Mixture models combines a set of distributions to create a convex space
where we can search for the optimal parameters for such distributions
using MLE.
• In GMM, Clusters modeled as Gaussians
• EM algorithm: assign data to cluster with some probability

Source Credit: Christopher M. Bishop

BITS Pilani, Pilani Campus


Gaussian Distribution
• The normal curve is bell-shaped and has a single peak at the
• exact center of the distribution.
• The arithmetic mean, median, and mode of the distribution are equal and located at
the peak.
• Half the area under the curve is above the peak, and the other half is below it.
• The normal distribution is symmetrical about its mean.

35

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


MATHEMATICAL FUNCTION (Pdf)

36

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Univariate and Multivariate
Gaussian density

BITS Pilani, Pilani Campus


2-dimensional Gaussian distribution

BITS Pilani, Pilani Campus


GMM – Mixture of Gaussians

• Probabilistic cluster is a distribution over the data


space, which can be mathematically represented
using a probability density function (or distribution
function).
• Our task: infer a set of k probabilistic clusters that is
mostly likely to generate D

BITS Pilani, Pilani Campus


Example: Mixture of 3 Gaussians

x x x
2 2 2

x x x
1 1 1

(a) (b) (c)

• The observed data come from a Gaussian mixture distribution which


consists of K Gaussians with their own means and variances.
• K classes are latent

BITS Pilani, Pilani Campus


GMM – Why single distribution is
insufficient

Time to next Geyser Eruptions (in


• Gaussian distribution is that it is intrinsically unimodal
• A single distribution is not sufficient to fit many real
data sets which is multimodal

Mins)
• A mixture model assumes that a set of observed
objects is a mixture of instances from multiple
probabilistic clusters, and conceptually each observed Length of Geyser Eruptions
(in Mins)
object is generated independently

Time to next Geyser Eruptions


(in Mins)
Length of Geyser Eruptions
(in Mins)
BITS Pilani, Pilani Campus
Learning a Mixture of Gaussians

Our actual Mixture of 3


observations Gaussians

Source Credit :CS229: Machine Learning ©2021 Carlos Guestrin

BITS Pilani, Pilani Campus


Gaussian Mixture Example: Start

Slide Credit: Andrew W. Moore


BITS Pilani, Pilani Campus
After first iteration

Slide Credit: Andrew W. Moore


BITS Pilani, Pilani Campus
After 2nd iteration

Slide Credit: Andrew W. Moore


BITS Pilani, Pilani Campus
After 3rd iteration

Slide Credit: Andrew W. Moore


BITS Pilani, Pilani Campus
After 4th iteration

Slide Credit: Andrew W. Moore


BITS Pilani, Pilani Campus
After 5th iteration

Slide Credit: Andrew W. Moore


BITS Pilani, Pilani Campus
After 6th iteration

Slide Credit: Andrew W. Moore


BITS Pilani, Pilani Campus
After 20th iteration

Slide Credit: Andrew W. Moore


BITS Pilani, Pilani Campus
GMM equations

Mixture models
K = number of mixture components (clusters),
πⱼ = mixture weights

pⱼ(x) = family of distributions


(Gaussian, Bernoulli, etc).

Gaussian Mixture models

θ = collection of all the parameters of the model


(mixture weights, means, and covariance
matrices):

BITS Pilani, Pilani Campus


General EM algorithm

Observed data: D={x1, …….., xn}

Unknown variables: y
In clustering: y=1….K clusters
Parameters: θ
In GMM : θ= π : {π1, . . . , πk}
μ : {μ1 , . . . , μK}
Σ : {Σ1, . . . ΣK }
Goal:

In GMM :

BITS Pilani, Pilani Campus


EM algorithm for GMM

E step:
Evaluate the responsibilities (posterior probabilities) using the current parameter

• For each example xn,


• Compute γ(znk) the probability that xn is generated by component Zk i.e it belongs
to cluster k
• If xn is very likely under the kth Gaussian, it gets high weight

BITS Pilani, Pilani Campus


EM algorithm for GMM

M-Step :
• maximize the expectation of the complete-data log-likelihood, computed with
respect to the conditional probabilities found in the Expectation step. The result of
the maximization is a new parameter vector μnew , Σnew and πnew

• Keep 𝛾 (znk ) fixed, and apply MLE for maximizing of ln p(X|π,µ,Σ) for μk , Σk and
πk, to get μnew , Σnew and πnew

Repeat E & M until convergence

– In practice, the algorithm is deemed to have converged when the change in


the log likelihood function, or alternatively in the parameters, falls below some
threshold

BITS Pilani, Pilani Campus


EM algorithm for GMM

To Estimate:
M-Step
Initialize 𝝿, 𝜇,Σ
also
E-Step
and evaluate the log likelihood

Perform E-Step Given 𝛾(zk)

Perform M-Step Given 𝝿, 𝜇,Σ

Repeat Until
Convergence

Nk = the effective number of points assigned to cluster k

BITS Pilani, Pilani Campus


EM Algorithm

I Standardize the data if required:

II Fix the no.of.cluster expected

III Initialize the prototypes:


Mean, Covariance, Weights

IV Expectation-Step: Fix prototype & find


the membership of each point weighted
by the probability value

V. Calculate the log likelihood of the


points

VI Maximization Step: Fix the


membership(responsibility matrix) and
re-estimate the prototypes

VII Calculate the new log likelihood of


the points. Repeat E & M Step till
convergence is achieved:

BITS Pilani, Pilani Campus


GMM

IV Expectation-Step: Fix prototype


& find the membership of each
point weighted by the probability
value

Sample calculation for this step is


shown below:

BITS Pilani, Pilani Campus


GMM

IV Expectation-Step: Fix prototype


& find the membership of each
point weighted by the probability
value

BITS Pilani, Pilani Campus


GMM

Per Docs sum


the previous
two found
For Doc2 : values.
Weight1*N1 = 0.0005 This gives the
Weight2*N2 = 0.0828 denominator
Sum = 0.08329 = ~0.0833 of this formula

P(ZDoc1-Cluster1) = 0.006 Now find the membership of doc :


P(ZDoc1-Cluster1) = 0.994 P(Zi1) = dividing Weight1*N1 by Sum
P(Zi2) = dividing Weight2*N1 by Sum

Similarly do it for all the Doc’s

BITS Pilani, Pilani Campus


GMM

V. Calculate the log likelihood of the


points

Log likehood =
Log likelihood = ln(sum of Doc1) + ln(sum of
ln(0.010) + ln(0.083) + ln(0.006)
Doc2 + ln(Sum of Doc3)….
=-12.16
This value needs to be found for every iteration
and at one iteration it will approach a very large
value . That will be one of the stopping
convergence point of the algorithm.

BITS Pilani, Pilani Campus


GMM

VI Maximization Step: Fix the


membership(responsibility matrix)
and re-estimate the prototypes

Sample calculations shown below

N1 for cluster k=1 is got by summing these.


Similarly do it for N2

For cluster k=1 is Means obtained by


multiplying the data by the membership.
[0.9999(2) + 0.006(4) + 1(7)]/N1 , [0.9999(2) + 0.006(5) + 1(2)]/N1

Similarly do it for cluster 2

Weightk=cluster1 (new)
= 2.0059/3 = 0.67
BITS Pilani, Pilani Campus
GMM - EM Algorithm
2nd Iteration continues
I Standardize the data if required:

II Fix the no.of.cluster expected

III Initialize the prototypes:


Mean, Covariance, Weights

IV Expectation-Step: Fix prototype


& find the membership of each
point weighted by the probability
value

V. Calculate the log likelihood of the


points

VI Maximization Step: Fix the


membership(responsibility matrix)
and re-estimate the prototypes

VII Calculate the new log likelihood


of the points. Repeat E & M Step till
convergence is achieved: BITS Pilani, Pilani Campus
Gaussian Mixture Model

Note:
The GMM model can be relaxed
by setting these parameters
while calling the function

• "spherical": all clusters must be spherical, but they can have different diameters
(i.e., different variances).
• "diag": clusters can take on any ellipsoidal shape of any size, but the ellipsoid’s axes
must be parallel to the coordinate axes (i.e., the covariance matrices must be
diagonal).
• "tied": all clusters must have the same ellipsoidal shape, size and orientation (i.e., all
clusters share the same covariance matrix).

BITS Pilani, Pilani Campus


Gaussian Mixture Model

from sklearn.mixture import GaussianMixture

gm = GaussianMixture(n_components=3, n_init=10)
gm.fit(X)

gm.weights_
gm.means_
gm.covariances_

BITS Pilani, Pilani Campus


References

• Christopher Bishop: Pattern Recognition and Machine Learning,


Springer International Edition

• https://www.youtube.com/watch?v=TG6Bh-NFhA0

• https://www.youtube.com/watch?v=qMTuMa86NzU

BITS Pilani, Pilani Campus

You might also like