Unit Iii
Unit Iii
• Additional latent variables allows to express relatively complex marginal distributions over latent variables
in terms of more tractable joint distributions over the expanded space.
• Maximum-Likelihood estimator in such a space is the Expectation-Maximization (EM) algorithm
K-MEANS CLUSTERING
K-MEANS CLUSTERING
K-MEANS CLUSTERING
Robbins-Monro procedure:
K-MEANS CLUSTERING
Image Segmentation and Compression
• The goal of segmentation is to partition an image into regions each of which has a reasonably homogeneous visual
appearance or which corresponds to objects or parts of objects.
• Each pixel in an image is a point in a 3-dimensional space comprising the intensities of the red, blue, and green channels,
and our segmentation algorithm simply treats each pixel in the image as a separate data point.
• We illustrate the result of running K-means to convergence, for any particular value of K, by re-drawing the image
replacing each pixel vector with the {R, G, B} intensity triplet given by the centre µ k to which that pixel has been
assigned. Results for various values of K are shown in below Figure
Unidentifiability
Factor analysis
Unidentifiability
Principal components analysis (PCA)
Principal components analysis (PCA)
Principal components analysis (PCA)
Principal components analysis (PCA)
Classical PCA: statement of the theorem
Principal components analysis (PCA)
Principal components analysis (PCA)
Principal components analysis (PCA)
Singular value decomposition (SVD)
Choosing the number of latent dimensions
Model selection for FA/PPCA
Model selection for PCA
Clustering
A grouping of data objects such that the objects within a group are similar (or near) to one another and dissimilar
(or far) from the objects in other groups. This method is defined under the branch of Unsupervised Learning,
which aims at gaining insights from unlabeled data points, that is, unlike supervised learning we don’t have a
target variable.
Clustering
Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters can be arbitrary.
There are many algortihms that work well with detecting arbitrary shaped clusters.
Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It evaluates the
similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan distance, etc. and then group
the points with highest similarity score together.
Clustering
Types of clustering
• Partitional - each object belongs in exactly one cluster
• Hierarchical - a set of nested clusters organized in a tree
A distinction among different types of clustering's is whether the set of clusters is nested or unnested.
A partitional clustering a simply a division of the set of data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset.
A hierarchical clustering is a set of nested clusters that are organized as a tree.
Similarity vs Dissimilarity measures
The similarity or dissimilarity between two objects is generally based on difference in corresponding attribute
values. In a clustering algorithm, the similarity or dissimilarity between objects is usually measured by a distance
function.
Data Types Similarity and Dissimilarity Measures
For nominal variables, these measures are binary, indicating whether two values are equal or not.
For ordinal variables, it is the difference between two values that are normalized by the max distance. For the
other variables, it is just a distance function.
Measuring (dis)similarity
Some common attribute dissimilarity functions are as follows:
Evaluating the output of clustering methods
Purity
• The purity ranges between 0 (bad) and 1 (good). However, we can trivially achieve a purity of 1 by
putting each object into its own cluster, so this measure does not penalize for the number of clusters.
Purity is
Complete link
Average link
Divisive clustering
• Another method involves building a minimum spanning tree from the dissimilarity graph and
creating new clusters by breaking the link with the largest dissimilarity.
Dissimilarity analysis steps:
4. Continue moving objects from G to H until a stopping criterion is met. At each step, pick the point i* that maximizes
the difference between the dissimilarity to G and the dissimilarity to H: