UnSupervisedLearning
UnSupervisedLearning
Clustering - DBSCAN
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
K Means Clustering
• Kmeans algorithm is an iterative The way kmeans algorithm works is as follows:
algorithm that tries to partition the
dataset into Kpre-defined distinct non- 1. Initialize: - Choose the number of clusters \( k \) - Randomly select \( k \) initial
overlapping subgroups (clusters) where centroids from the dataset
each data point belongs to only one
group. 2. Repeat until convergence:
• It assigns data points to a cluster such a) Assignment Step: - For each data point \( x_i \): - Calculate the distance
that the sum of the squared distance between \( x_i \) and each centroid \( c_j \) - Assign \( x_i \) to the cluster \( j \)
between the data points and the
cluster’s centroid (arithmetic mean of with the nearest centroid
all the data points that belong to that
cluster) is at the minimum. The less
b) Update Step: - For each cluster \( j \): - Update the centroid \( c_j \) as the
variation we have within clusters, the mean of all data points assigned to cluster \( j \) - \( c_j = \frac{1}{|C_j|} \
more homogeneous (similar) the data
points are within the same cluster. sum_{x_i \in C_j} x_i \)
3. Check for convergence: - If centroids do not change or change is below a
predefined threshold, stop
4. Output: - The final cluster assignments and centroids
1. Since clustering algorithms including kmeans use distance-based measurements to determine the similarity between data points, it’s
recommended to standardize the data to have a mean of zero and a standard deviation of one since almost always the
features in any dataset would have different units of measurements such as age vs income.
2. Given kmeans iterative nature and the random initialization of centroids at the start of the algorithm, different initializations
may lead to different clusters since kmeans algorithm may stuck in a local optimum and may not converge to global
optimum. Therefore, it’s recommended to run the algorithm using different initializations of centroids and pick the results of the run
that that yielded the lower sum of squared distance.
Distance measure used is Eucledian Distanbce and works for numerical variables only. Kmeans can also be adapted for :
1) Manhattan Distance: Useful for high-dimensional data or when dealing with grid-like structures. 2)Cosine Similarity: Often used in text
data where the angle between vectors is more meaningful than their magnitude
K Means Clustering – Evaluation Methods
Elbow Method
Elbow method gives us an idea on what a good k number of clusters would be
based on the sum of squared distance (SSE) between data points and their
assigned clusters’ centroids. We pick k at the spot where SSE starts to flatten out
and forming an elbow.
Silhouette Analysis
Silhouette analysis can be used to determine the degree of separation between The silhouette coefficient can take values in the interval [-1, 1].
clusters. For each sample: If it is 0 –> the sample is very close to the neighboring
Compute the average distance from all data points in the same cluster (ai). clusters.
Compute the average distance from all data points in the closest cluster It it is 1 –> the sample is far away from the neighboring
(bi). clusters.
Compute the coefficient: It it is -1 –> the sample is assigned to the wrong
clusters.
Hierarchical Clustering
There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage
Methods. Some of the common linkage methods are:
Complete-linkage: the distance between two clusters is defined as the longest distance between two points in each cluster. Tends to create
compact clusters and is less sensitive to noise and outliers. Tends to create compact clusters and is less sensitive to noise and
outliers.
Single-linkage: the distance between two clusters is defined as the shortest distance between two points in each cluster. This linkage may be
used to detect high values in your dataset which may be outliers as they will be merged at the end.
Average-linkage: the distance between two clusters is defined as the average distance between each point in one cluster to every point in
the other cluster.
Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging.
• The choice of linkage method entirely depends on you and there is no hard and fast method that will always give you good results.
Hierarchical Clustering-Complete Linkage example
Steps to Calculate Complete Linkage
1. Data Points: Start with your dataset, which can be a set of points in a space (e.g., 2D coordinates).
2. Calculate Pairwise Distances: Compute the distances between every pair of points across the two clusters.
3. Identify the Maximum Distance: For the two clusters, find the maximum distance from all the computed distances.
Example Calculation
Let's say we have two clusters:
Focus on Extremes: Because it looks at the farthest
- **Cluster A**: Points (A1(1, 2)), (A2(2, 3)) points, it forces clusters to stay more compact. When
- **Cluster B**: Points (B1(5, 6)), (B2(7, 8))
merging clusters, if any pair of points in different
#### Step 1: Calculate Pairwise Distances clusters is farther apart than the maximum distance
Using Euclidean distance: threshold, the clusters won't merge, which helps
d(p, q) = sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2} maintain tighter groups.
- Distance between (A1) and (B1):
d(A1, B1) = sqrt{(1 - 5)^2 + (2 - 6)^2} = sqrt{16 + 16} = sqrt{32} approx 5.66
One question that might have intrigued you by now is how do you decide
when to stop merging the clusters?
You cut the dendrogram tree with a horizontal line at a height where the line can
traverse the maximum distance up and down without intersecting the merging
point.
For example in the below figure L3 can traverse maximum distance up and down
without intersecting the merging points. So we draw a horizontal line and the
number of verticle lines it intersects is the optimal number of clusters.
Number of clusters = 3
Density Based Clustering Technique - DBSCAN
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for
finding spherical-shaped clusters or convex clusters. In other words, they are suitable
only for compact and well-separated clusters. Moreover, they are also severely
affected by the presence of noise and outliers in the data. They are not able to form
clusters based on varying densities. That’s why we need DBSCAN – Density Based
Spatial Clustering of Applications with Noise
eps: It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps
value is chosen too small then a large part of the data will be considered as an outlier.
If it is chosen very large then the clusters will merge and the majority of the data points
will be in the same clusters. One way to find the eps value is based on the k-distance
graph.
MinPts: Minimum number of neighbors (data points) within eps radius. The
larger the dataset, the larger value of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number of dimensions D in the dataset as,
MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.
Uses – Primarily used in Geospatial data/or in dataset where there are extreme
observations
DBSCAN- continued
DBSCAN in resource intensive algorithm. Why? Lets see:
DBSCAN creates a circle of epsilon radius around every data point and classifies them
into Core point, Border point, and Noise. A data point is a Core point if the circle around it
contains at least ‘minPoints’ number of points. If the number of points is less than minPoints, then
it is classified as Border Point, and if there are no other data points around any data point
within epsilon radius, then it treated as Noise.
Reachability states if a data point can be accessed from another data point directly or indirectly,
whereas Connectivity states whether two data points belong to the same cluster or not.
Two points in DBSCAN can be referred to as: All the data points with at least 3 points
• Directly Density-Reachable in the circle including itself are
• Density-Reachable considered as Core points represented
• Density-Connected by red color. All the data points with
less than 3 but greater than 1 point in
the circle including itself are
considered as Border points. They are
represented by yellow color. Finally,
data points with no point other than
itself present inside the circle are
considered as Noise represented by
the purple color.
DBSCAN- continued
Reachability and Connectivity
A point X is density-connected from point Y w.r.t epsilon and minPoints if there exists a
point O such that both X and Y are density-reachable from O w.r.t to epsilon and
minPoints.
Here, both X and Y are density-reachable from O, therefore, we can say that X is
density-connected from Y.
DBSCAN- pseudo algo
1. Find all the neighbor points within eps and identify the core points or visited
with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new
cluster.
3. Find recursively all its density-connected points and assign them to the same
cluster as the core point.
A point a and b are said to be density connected if there exists a
point c which has a sufficient number of points in its neighbors and both
points a and b are within the eps distance. This is a chaining process. So,
if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which in
turn is neighbor of a implying that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points
that do not belong to any cluster are noise.
DBSCAN- Kdistance graph
To choose the value of ε, a k-distance graph is plotted by ordering the distance to the k=MinPts-1 nearest neighbor from the largest to the
smallest value.
The method proposed here consists of computing the k-nearest neighbor distances in a matrix of points. The idea is to calculate, the average of
the distances of every point to its k nearest neighbors. The value of k will be specified by the user and corresponds to MinPts. Next, these k-
distances are plotted in ascending order. The aim is to determine the “knee”, which corresponds to the optimal epsilon parameter. A knee
corresponds to a threshold where a sharp change occurs along the k-distance curve. It can be seen that the optimal eps value is around a
distance of 0.15.
The second principal component is calculated in the same way, with the
condition that it is uncorrelated with (i.e., perpendicular to) the first
principal component and that it accounts for the next highest variance.
This continues until a total of p principal components have been
calculated, equal to the original number of variables.
Principal Component Analysis – Dimensionality
Reduction Technique –Step wise calculation
Step1 - Standardization
Eigenvectors and eigenvalues are the linear algebra concepts that If we rank the eigenvalues in descending order, we get
we need to compute from the covariance matrix in order to λ1>λ2, which means that the eigenvector that corresponds
determine the principal components of the data. to the first principal component (PC1) is v1 and the one that
corresponds to the second principal component (PC2) is v2.
What you first need to know about eigenvectors and eigenvalues is After having the principal components, to compute the
that they always come in pairs, so that every eigenvector has an percentage of variance (information) accounted for by each
eigenvalue. Also, their number is equal to the number of component, we divide the eigenvalue of each component by
dimensions of the data. For example, for a 3-dimensional data set, the sum of eigenvalues.
there are 3 variables, therefore there are 3 eigenvectors with 3
corresponding eigenvalues.
So, the feature vector is simply a matrix that has as columns the eigenvectors of the
components that we decide to keep. This makes it the first step towards dimensionality
Step 5: Recast the Data Along the
reduction, because if we choose to keep only p eigenvectors (components) out of n, the final
Principal Components Axes
data set will have only p dimensions.
. The aim is to use the feature vector formed
Continuing with the example from the previous step, we can either form a feature vector with using the eigenvectors of the covariance
matrix, to reorient the data from the original
both of the eigenvectors v1 and v2, Or discard the eigenvector v2, which is the one of lesser axes to the ones represented by the
significance, and form a feature vector with v1 only: principal components (hence the name
Principal Components Analysis). This can
Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a be done by multiplying the transpose of the
loss of information in the final data set. But given that v2 was carrying only 4 percent of the original data set by the transpose of the
feature vector.
information, the loss will be therefore not important and we will still have 96 percent of the
information that is carried by v1.