0% found this document useful (0 votes)
6 views

UnSupervisedLearning

Uploaded by

bhattnirmal15
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

UnSupervisedLearning

Uploaded by

bhattnirmal15
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Agenda: Supervised Learning

Clustering – K means, Heirarchical

Clustering - DBSCAN

Principal Component Analysis


What is Cluster Analysis?
● Finding groups of objects such that the objects in a group will be similar (or related)
to one another and different from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
K Means Clustering
• Kmeans algorithm is an iterative The way kmeans algorithm works is as follows:
algorithm that tries to partition the
dataset into Kpre-defined distinct non- 1. Initialize: - Choose the number of clusters \( k \) - Randomly select \( k \) initial
overlapping subgroups (clusters) where centroids from the dataset
each data point belongs to only one
group. 2. Repeat until convergence:

• It assigns data points to a cluster such a) Assignment Step: - For each data point \( x_i \): - Calculate the distance
that the sum of the squared distance between \( x_i \) and each centroid \( c_j \) - Assign \( x_i \) to the cluster \( j \)
between the data points and the
cluster’s centroid (arithmetic mean of with the nearest centroid
all the data points that belong to that
cluster) is at the minimum. The less
b) Update Step: - For each cluster \( j \): - Update the centroid \( c_j \) as the
variation we have within clusters, the mean of all data points assigned to cluster \( j \) - \( c_j = \frac{1}{|C_j|} \
more homogeneous (similar) the data
points are within the same cluster. sum_{x_i \in C_j} x_i \)
3. Check for convergence: - If centroids do not change or change is below a
predefined threshold, stop
4. Output: - The final cluster assignments and centroids

The approach kmeans follows to solve the problem is called Expectation-


Maximization. The E-step is assigning the data points to the closest cluster. The M-step is
computing the centroid of each cluster.
K Means Clustering – Few Things to note

1. Since clustering algorithms including kmeans use distance-based measurements to determine the similarity between data points, it’s
recommended to standardize the data to have a mean of zero and a standard deviation of one since almost always the
features in any dataset would have different units of measurements such as age vs income.

2. Given kmeans iterative nature and the random initialization of centroids at the start of the algorithm, different initializations
may lead to different clusters since kmeans algorithm may stuck in a local optimum and may not converge to global
optimum. Therefore, it’s recommended to run the algorithm using different initializations of centroids and pick the results of the run
that that yielded the lower sum of squared distance.

Distance measure used is Eucledian Distanbce and works for numerical variables only. Kmeans can also be adapted for :
1) Manhattan Distance: Useful for high-dimensional data or when dealing with grid-like structures. 2)Cosine Similarity: Often used in text
data where the angle between vectors is more meaningful than their magnitude
K Means Clustering – Evaluation Methods

Since this is an unsupervised problem, there is no right or wrong answer in terms


of selecting the number of clusters. Domain knowledge and intuition might help.
Two metrics that may give us some intuition about k: 1) Elbow method
2)Silhouette analysis

Elbow Method
Elbow method gives us an idea on what a good k number of clusters would be
based on the sum of squared distance (SSE) between data points and their
assigned clusters’ centroids. We pick k at the spot where SSE starts to flatten out
and forming an elbow.
Silhouette Analysis
Silhouette analysis can be used to determine the degree of separation between The silhouette coefficient can take values in the interval [-1, 1].
clusters. For each sample:  If it is 0 –> the sample is very close to the neighboring
 Compute the average distance from all data points in the same cluster (ai). clusters.
 Compute the average distance from all data points in the closest cluster  It it is 1 –> the sample is far away from the neighboring
(bi). clusters.
 Compute the coefficient:  It it is -1 –> the sample is assigned to the wrong
clusters.
Hierarchical Clustering

Hierarchical cluster analysis or HCA is an Agglomerative Hierarchical Clustering


unsupervised clustering algorithm which
involves creating clusters that have The Agglomerative Hierarchical Clustering is the most common type of hierarchical clustering
predominant ordering from top to bottom. used to group objects in clusters based on their similarity. It’s also known as AGNES
This clustering technique is divided into two (Agglomerative Nesting). It's a bottom-up approach: each observation starts in its own
types: cluster, and pairs of clusters are merged as one moves up the hierarchy.
1. Agglomerative Hierarchical Clustering How does it work?
2. Divisive Hierarchical Clustering 1. Make each data point a single-point cluster → forms N clusters
2. Take the two closest data points and make them one cluster → forms N-1 clusters
3. Take the two closest clusters and make them one cluster → Forms N-2 clusters.
4. Repeat step-3 until you are left with only one cluster.
Hierarchical Clustering

Agglomerative Hierarchical Clustering

There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage
Methods. Some of the common linkage methods are:
 Complete-linkage: the distance between two clusters is defined as the longest distance between two points in each cluster. Tends to create
compact clusters and is less sensitive to noise and outliers. Tends to create compact clusters and is less sensitive to noise and
outliers.
 Single-linkage: the distance between two clusters is defined as the shortest distance between two points in each cluster. This linkage may be
used to detect high values in your dataset which may be outliers as they will be merged at the end.
 Average-linkage: the distance between two clusters is defined as the average distance between each point in one cluster to every point in
the other cluster.
 Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging.

• The choice of linkage method entirely depends on you and there is no hard and fast method that will always give you good results.
Hierarchical Clustering-Complete Linkage example
Steps to Calculate Complete Linkage

1. Data Points: Start with your dataset, which can be a set of points in a space (e.g., 2D coordinates).
2. Calculate Pairwise Distances: Compute the distances between every pair of points across the two clusters.
3. Identify the Maximum Distance: For the two clusters, find the maximum distance from all the computed distances.

Example Calculation
Let's say we have two clusters:
Focus on Extremes: Because it looks at the farthest
- **Cluster A**: Points (A1(1, 2)), (A2(2, 3)) points, it forces clusters to stay more compact. When
- **Cluster B**: Points (B1(5, 6)), (B2(7, 8))
merging clusters, if any pair of points in different
#### Step 1: Calculate Pairwise Distances clusters is farther apart than the maximum distance
Using Euclidean distance: threshold, the clusters won't merge, which helps
d(p, q) = sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2} maintain tighter groups.
- Distance between (A1) and (B1):
d(A1, B1) = sqrt{(1 - 5)^2 + (2 - 6)^2} = sqrt{16 + 16} = sqrt{32} approx 5.66

- Distance between (A1) and (B2):


d(A1, B2) = sqrt{(1 - 7)^2 + (2 - 8)^2} = sqrt{36 + 36} = sqrt{72} approx 8.49

- Distance between (A2) and (B1):


d(A2, B1) = sqrt{(2 - 5)^2 + (3 - 6)^2} = sqrt{9 + 9} = sqrt{18} approx 4.24

- Distance between (A2) and (B2):


d(A2, B2) = sqrt{(2 - 7)^2 + (3 - 8)^2} = sqrt{25 + 25} = sqrt{50} approx 7.07

#### Step 2: Identify the Maximum Distance


The maximum distance is:
Max Distance = 8.49
Hierarchical Clustering-Dendogram
Dendrogram contains the memory of hierarchical clustering algorithm, so just by
looking at the Dendogram you can tell how the cluster is formed.
• Distance between data points represents dissimilarities.
• Height of the blocks represents the distance between clusters.
 The Clades are the branch and are arranged according to how similar (or
dissimilar) they are. Clades that are close to the same height are similar to
each other; clades with different heights are dissimilar — the greater the
difference in height, the more dissimilarity.
 Each clade has one or more leaves.
 Leaves A, B, and C are more similar to each other than they are to leaves D,
E, or F.
 Leaves D and E are more similar to each other than they are to leaves A, B,
C, or F.
 Leaf F is substantially different from all of the other leaves.
Hierarchical Clustering-Dendogram

One question that might have intrigued you by now is how do you decide
when to stop merging the clusters?
You cut the dendrogram tree with a horizontal line at a height where the line can
traverse the maximum distance up and down without intersecting the merging
point.
For example in the below figure L3 can traverse maximum distance up and down
without intersecting the merging points. So we draw a horizontal line and the
number of verticle lines it intersects is the optimal number of clusters.

Number of clusters = 3
Density Based Clustering Technique - DBSCAN
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for
finding spherical-shaped clusters or convex clusters. In other words, they are suitable
only for compact and well-separated clusters. Moreover, they are also severely
affected by the presence of noise and outliers in the data. They are not able to form
clusters based on varying densities. That’s why we need DBSCAN – Density Based
Spatial Clustering of Applications with Noise

Two parameters for DBSCAN:

eps: It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps
value is chosen too small then a large part of the data will be considered as an outlier.
If it is chosen very large then the clusters will merge and the majority of the data points
will be in the same clusters. One way to find the eps value is based on the k-distance
graph.

MinPts: Minimum number of neighbors (data points) within eps radius. The
larger the dataset, the larger value of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number of dimensions D in the dataset as,
MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.

Uses – Primarily used in Geospatial data/or in dataset where there are extreme
observations
DBSCAN- continued
DBSCAN in resource intensive algorithm. Why? Lets see:

DBSCAN creates a circle of epsilon radius around every data point and classifies them
into Core point, Border point, and Noise. A data point is a Core point if the circle around it
contains at least ‘minPoints’ number of points. If the number of points is less than minPoints, then
it is classified as Border Point, and if there are no other data points around any data point
within epsilon radius, then it treated as Noise.

Reachability and Connectivity

Reachability states if a data point can be accessed from another data point directly or indirectly,
whereas Connectivity states whether two data points belong to the same cluster or not.
Two points in DBSCAN can be referred to as: All the data points with at least 3 points
• Directly Density-Reachable in the circle including itself are
• Density-Reachable considered as Core points represented
• Density-Connected by red color. All the data points with
less than 3 but greater than 1 point in
the circle including itself are
considered as Border points. They are
represented by yellow color. Finally,
data points with no point other than
itself present inside the circle are
considered as Noise represented by
the purple color.
DBSCAN- continued
Reachability and Connectivity

A point X is directly density-reachable from point Y w.r.t epsilon, minPoints if,


1. X belongs to the neighborhood of Y, i.e, dist(X, Y) <= epsilon
2. Y is a core point
Here, X is directly density-reachable from Y, but vice versa is not valid.

A point X is density-reachable from point Y w.r.t epsilon, minPoints if there is a chain of


points p1, p2, p3, …, pn and p1=X and pn=Y such that pi+1 is directly density-
reachable from pi.

Here, X is density-reachable from Y with X being directly density-reachable


from P2, P2 from P3, and P3 from Y. But, the inverse of this is not valid.

A point X is density-connected from point Y w.r.t epsilon and minPoints if there exists a
point O such that both X and Y are density-reachable from O w.r.t to epsilon and
minPoints.

Here, both X and Y are density-reachable from O, therefore, we can say that X is
density-connected from Y.
DBSCAN- pseudo algo
1. Find all the neighbor points within eps and identify the core points or visited
with more than MinPts neighbors.

2. For each core point if it is not already assigned to a cluster, create a new
cluster.

3. Find recursively all its density-connected points and assign them to the same
cluster as the core point.
A point a and b are said to be density connected if there exists a
point c which has a sufficient number of points in its neighbors and both
points a and b are within the eps distance. This is a chaining process. So,
if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which in
turn is neighbor of a implying that b is a neighbor of a.

4. Iterate through the remaining unvisited points in the dataset. Those points
that do not belong to any cluster are noise.
DBSCAN- Kdistance graph
To choose the value of ε, a k-distance graph is plotted by ordering the distance to the k=MinPts-1 nearest neighbor from the largest to the
smallest value.

The method proposed here consists of computing the k-nearest neighbor distances in a matrix of points. The idea is to calculate, the average of
the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts. Next, these k-
distances are plotted in ascending order. The aim is to determine the “knee”, which corresponds to the optimal epsilon parameter. A knee
corresponds to a threshold where a sharp change occurs along the k-distance curve. It can be seen that the optimal eps value is around a
distance of 0.15.

1. Compute k-Distances:For each point in your dataset,


calculate the distance to its k-th nearest neighbor. The
value of k is typically set to MinPts - 1.

2. Plot the k-Distances:Create a graph where the x-axis


represents the points in your dataset (sorted by their k-
distances) and the y-axis represents the distance to the k-
th nearest neighbor for each point.

3. Identify the Epsilon:Look for a point where the graph


shows a significant "knee" or "elbow." This point indicates a
transition from low-density regions (where distances are
relatively small) to high-density regions (where distances
start to increase significantly).The y-value at this knee
point suggests an appropriate choice for ε.
Principal Component Analysis – Dimensionality
Reduction Technique
Principal component analysis (PCA) is a statistical method that reduces
the number of dimensions in a dataset by transforming the variables into
a smaller set of principal components.

How Do You Do a Principal Component Analysis?


1. Standardize the range of continuous initial variables
2. Compute the covariance matrix to identify correlations
3. Compute the eigenvectors and eigenvalues of the covariance matrix
to identify the principal components
4. Create a feature vector to decide which principal components to
keep
5. Recast the data along the principal components axes

What are principal components?

Principal components are new variables that are constructed as linear


combinations or mixtures of the initial variables. These combinations are
done in such a way that the new variables (i.e., principal components)
are uncorrelated and most of the information within the initial variables is
squeezed or compressed into the first components. So, the idea is 10-
dimensional data gives you 10 principal components, but PCA tries to
put maximum possible information in the first component, then maximum
remaining information in the second and so on, until having something
like shown in the scree plot below.
Principal Component Analysis – Dimensionality
Reduction Technique
How PCA Constructs the Principal Components

Principal components are constructed in such a manner that the first


principal component accounts for the largest possible variance in the
data set. For example, let’s assume that the scatter plot of our data set
is as shown below, can we guess the first principal component ? Yes, it’s
approximately the line that matches the purple marks because it goes
through the origin and it’s the line in which the projection of the points
(red dots) is the most spread out. Or mathematically speaking, it’s the
line that maximizes the variance (the average of the squared distances
from the projected points (red dots) to the origin).

The second principal component is calculated in the same way, with the
condition that it is uncorrelated with (i.e., perpendicular to) the first
principal component and that it accounts for the next highest variance.
This continues until a total of p principal components have been
calculated, equal to the original number of variables.
Principal Component Analysis – Dimensionality
Reduction Technique –Step wise calculation
Step1 - Standardization

The aim of this step is to standardize the range of the


continuous initial variables so that each one of them
contributes equally to the analysis.

More specifically, the reason why it is critical to perform


standardization prior to PCA, is that the latter is quite
sensitive regarding the variances of the initial variables.
That is, if there are large differences between the
ranges of initial variables, those variables with larger
ranges will dominate over those with small ranges (for
example, a variable that ranges between 0 and 100 will
dominate over a variable that ranges between 0 and 1),
which will lead to biased results. So, transforming the
data to comparable scales can prevent this problem.
Principal Component Analysis – Dimensionality
Reduction Technique –Step wise calculation
Step2 - Covariance Matrix Computation
What do the covariances that we have as entries of the
matrix tell us about the correlations between the
variables?
The aim of this step is to understand how the variables of the
It’s actually the sign of the covariance that matters:
input data set are varying from the mean with respect to each
• If positive then: the two variables increase or decrease
other, or in other words, to see if there is any relationship
together (correlated)
between them. Because sometimes, variables are highly
• If negative then: one increases when the other
correlated in such a way that they contain redundant
decreases (Inversely correlated)
information. So, in order to identify these correlations, we
compute the covariance matrix.

The covariance matrix is a p × p symmetric matrix (where p is


the number of dimensions) that has as entries the
covariances associated with all possible pairs of the initial
variables.

For example, for a 3-dimensional data set with 3


variables x, y, and z, the covariance matrix is a
3×3 data matrix of this from:
Principal Component Analysis – Dimensionality
Reduction Technique –Step wise calculation
Step 3: Compute the eigenvectors and eigenvalues of the Let’s suppose that our data set is 2-dimensional with 2
covariance matrix to identify the principal components variables x,y and that the eigenvectors and eigenvalues of
the covariance matrix are as follows:

Eigenvectors and eigenvalues are the linear algebra concepts that If we rank the eigenvalues in descending order, we get
we need to compute from the covariance matrix in order to λ1>λ2, which means that the eigenvector that corresponds
determine the principal components of the data. to the first principal component (PC1) is v1 and the one that
corresponds to the second principal component (PC2) is v2.
What you first need to know about eigenvectors and eigenvalues is After having the principal components, to compute the
that they always come in pairs, so that every eigenvector has an percentage of variance (information) accounted for by each
eigenvalue. Also, their number is equal to the number of component, we divide the eigenvalue of each component by
dimensions of the data. For example, for a 3-dimensional data set, the sum of eigenvalues.
there are 3 variables, therefore there are 3 eigenvectors with 3
corresponding eigenvalues.

It is eigenvectors and eigenvalues who are behind all the magic of


principal components because the eigenvectors of the Covariance
matrix are actually the directions of the axes where there is the
most variance (most information) and that we call Principal
Components. And eigenvalues are simply the coefficients attached
to eigenvectors, which give the amount of variance carried in each
Principal Component.

By ranking your eigenvectors in order of their eigenvalues, highest


to lowest, you get the principal components in order of significance.
Principal Component Analysis – Dimensionality
Reduction Technique –Step wise calculation
Step 4: Create a feature vector
In this step, what we do is, to choose whether to keep all these components or discard
those of lesser significance (of low eigenvalues), and form with the remaining ones a
matrix of vectors that we call Feature vector.

So, the feature vector is simply a matrix that has as columns the eigenvectors of the
components that we decide to keep. This makes it the first step towards dimensionality
Step 5: Recast the Data Along the
reduction, because if we choose to keep only p eigenvectors (components) out of n, the final
Principal Components Axes
data set will have only p dimensions.
. The aim is to use the feature vector formed
Continuing with the example from the previous step, we can either form a feature vector with using the eigenvectors of the covariance
matrix, to reorient the data from the original
both of the eigenvectors v1 and v2, Or discard the eigenvector v2, which is the one of lesser axes to the ones represented by the
significance, and form a feature vector with v1 only: principal components (hence the name
Principal Components Analysis). This can
Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a be done by multiplying the transpose of the
loss of information in the final data set. But given that v2 was carrying only 4 percent of the original data set by the transpose of the
feature vector.
information, the loss will be therefore not important and we will still have 96 percent of the
information that is carried by v1.

You might also like