0% found this document useful (0 votes)
9 views

4 Clustring

The document discusses different clustering algorithms including k-means, k-modes, partitioning around medoids (PAM), hierarchical clustering, and density-based clustering using DBSCAN. It provides an overview of how k-means works including determining the number of clusters and additional considerations when using k-means.

Uploaded by

Mike Amukhumba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

4 Clustring

The document discusses different clustering algorithms including k-means, k-modes, partitioning around medoids (PAM), hierarchical clustering, and density-based clustering using DBSCAN. It provides an overview of how k-means works including determining the number of clusters and additional considerations when using k-means.

Uploaded by

Mike Amukhumba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

ALY 6040: DATA MINING

CLUSTERING

FATEMEH (PARISSA) AHMADI, PH.D.


INTRODUCTION
 In general, clustering is the use of unsupervised techniques for grouping similar
objects. In machine learning, unsupervised refers to the problem of finding hidden
structure within unlabeled data (You can use a labeled dataset for clustering practice if
you remove the label column). Clustering techniques are unsupervised in the sense that
the data scientist does not determine, in advance, the labels to apply to the clusters.
 For example, based on customers’ personal income, it is straightforward to divide the
customers into three groups depending on arbitrarily selected values. The customers
could be divided into three groups as follows:
• Earn less than $10,000
• Earn between $10,000 and $99,999
• Earn $100,000 or more
DIFFERENT
CLUSTERING
CATEGORIES
K-MEANS

 Given a collection of objects each with n measurable attributes, k-means is an


analytical technique that, for a chosen value of k, identifies k clusters of objects based
on the objects’ proximity to the center of the k groups.
 The center is determined as the arithmetic average (mean) of each cluster’s n-
dimensional vector of attributes.
 Next figure illustrates three clusters of objects with two attributes. Each object in the
dataset is represented by a small dot color-coded to the closest large dot, the mean of
the cluster.
K-MEANS

K-Means for
K=3
USE CASES
USE CASES
OVERVIEW OF THE METHOD
OVERVIEW OF THE METHOD

Step 1:
https://www.bing.com/videos/search?ptag=ICO-
a37ed4490a84afc3&pc=1MSC&q=kmeans+algorithm+video&ru=%2fsearch%3fptag%3dIC
O-
a37ed4490a84afc3%26form%3dINCOH1%26pc%3d1MSC%26q%3dkmeans%2520algorith
m%2520video&view=detail&mmscn=vwrc&mid=FFB271DF90375FC9E55CFFB271DF903
75FC9E55C&FORM=WRVORC
OVERVIEW OF THE METHOD
OVERVIEW OF THE METHOD

Step 2:
OVERVIEW OF THE METHOD
OVERVIEW OF THE
METHOD

Step 3:
OVERVIEW OF THE METHOD
DETERMINING THE NUMBER OF CLUSTERS

 With the preceding algorithm, k clusters can be identified in a given dataset, but what
value of k should be selected?
 The value of k can be chosen based on a reasonable guess or some predefined
requirement. However, even then, it would be good to know how much better or
worse having k clusters versus k – 1 or k + 1 clusters would be in explaining the
structure of the data. Next, a heuristic using the Within Sum of Squares (WSS) metric
is examined to determine a reasonably optimal value of k. Using the distance function
given in following equation, WSS is defined as shown in equation below:
DETERMINING THE NUMBER OF CLUSTERS

 In other words, WSS is the sum of the squares of the distances between each data
point and the closest centroid. The term q(i) indicates the closest centroid that is
associated with the ith point. If the points are relatively close to their respective
centroids, the WSS is relatively small. Thus, if k + 1 clusters do not greatly reduce the
value of WSS from the case with only k clusters, there may be little benefit to adding
another cluster.
CLUSTERING-EXAMPLE
CLUSTERING-EXAMPLE
CLUSTERING-EXAMPLE

Size compare to
default size=1
Plot Character: Star

Apply(dataset, row-wise/column-wise, function):


2: row-wise, var: Variance
CLUSTERING-EXAMPLE

An elbow of k=2 in clear. In cases where the elbow location is


ambiguous, other metrics such as Silhouette should be used.
CLUSTERING-EXAMPLE

Best Number of Clusters


Number of Measures
CLUSTERING-EXAMPLE
CLUSTERING-EXAMPLE
PAMK (PARTITIONING AROUND MEDOIDS)
PAMK
PAMK
ADDITIONAL CONSIDERATIONS

 The k-means algorithm is sensitive to the starting positions of the initial centroid. Thus, it is important
to rerun the k-means analysis several times for a particular value of k to ensure the cluster results
provide the overall minimum WSS. As seen earlier, this task is accomplished in R by using the nstart
option in the kmeans() function call.
 Here, the use of the Euclidean distance function is presented to assign the points to the closest
centroids. Other possible function choices include the Cosine similarity and the Manhattan distance
functions. The Cosine similarity function is often chosen to compare two documents based on the
frequency of each word that appears in each of the documents.
 For two points, p and q, at (p1, p2, …, pn) and (q1, q2, …, qn) , respectively, the Manhattan distance, d1 ,
between p and q is expressed as shown in next Equation:
ADDITIONAL CONSIDERATIONS

 The Manhattan distance function is analogous to the distance traveled by a car in a city, where the
streets are laid out in a rectangular grid (such as city blocks). In Euclidean distance, the measurement is
made in a straight line. Using previous equation, the distance from (1, 1) to (4, 5) would be |1 – 4| + |1
– 5| = 7. From an optimization perspective, if there is a need to use the Manhattan distance for a
clustering analysis, the median is a better choice for the centroid than use of the mean.
 K-means clustering is applicable to objects that can be described by attributes that are numerical with
a meaningful distance measure. However, k-means does not handle categorical variables well.
 For example, suppose a clustering analysis is to be conducted on new car sales. Among other attributes,
such as the sale price, the color of the car is considered important. Although one could assign numerical
values to the color, such as red = 1, yellow = 2, and green = 3, it is not useful to consider that yellow is
as close to red as yellow is to green from a clustering perspective. In such cases, it may be necessary to
use an alternative clustering methodology.
ADDITIONAL ALGORITHMS

 The k-means clustering method is easily applied to numeric data where the concept of
distance can naturally be applied. However, it may be necessary or desirable to use an
alternative clustering algorithm.
 As discussed at the end of the previous section, k-means does not handle categorical data.
In such cases, k-modes is a commonly used method for clustering categorical data based
on the number of differences in the respective components of the attributes.
 For example, if each object has four attributes, the distance from (a, b, e, d) to (d, d, d, d) is 3.
In R, the function kmode() is implemented in the klaR package.
ADDITIONAL ALGORITHMS

 Because k-means and k-modes divide the entire dataset into distinct groups, both approaches
are considered partitioning methods. A third partitioning method is known as Partitioning
around Medoids (PAM).
 In general, a medoid is a representative object in a set of objects. In clustering, the medoids
are the objects in each cluster that minimize the sum of the distances from the medoid to the
other objects in the cluster.
 The advantage of using PAM is that the “center” of each cluster is an actual object in
the dataset. PAM is implemented in R by the pam() function included in the cluster R
package. The fpc R package includes a function pamk(), which uses the pam() function to find
the optimal value for k.
ADDITIONAL ALGORITHMS

 Other clustering methods include hierarchical agglomerative clustering and density


clustering methods. In hierarchical agglomerative clustering, each object is initially placed in
its own cluster. The clusters are then combined with the most similar cluster. This process is
repeated until one cluster, which includes all the objects, exists. The R stats package includes
the hclust() function for performing hierarchical agglomerative clustering.

 In density-based clustering methods, the clusters are identified by the concentration of


points. The fpc R package includes a function, dbscan(), to perform density-based clustering
analysis. Density-based clustering can be useful to identify irregularly shaped clusters.
DBSCAN

▪ Core Points
▪ Border/Boundary Points
▪ Noise/Outlier

https://www.bing.com/videos/search?q=dbscan+algorithm+video&qs=n&sp=-1&lq=0&pq=dbscan+algorithm+video&sc=6-
22&sk=&cvid=F20866D7EA994B10A858FF746FB4DD29&ghsh=0&ghacc=0&ghpl=&ru=%2fsearch%3fq%3ddbscan%2balgorithm%2bvideo
%26qs%3dn%26form%3dQBRE%26sp%3d-1%26lq%3d0%26pq%3ddbscan%2balgorithm%2bvideo%26sc%3d6-
22%26sk%3d%26cvid%3dF20866D7EA994B10A858FF746FB4DD29%26ghsh%3d0%26ghacc%3d0%26ghpl%3d&view=detail&mmscn=vwr
c&mid=4F8A4D63F261C954ABF64F8A4D63F261C954ABF6&FORM=WRVORC

https://www.bing.com/videos/riverview/relatedvideo?&q=dbscan&&mid=35884E6B4BFFDAB2F00035884E6B4BFFDAB2F000&&FORM=VRDGAR
https://www.bing.com/videos/riverview/relatedvideo?&q=dbscan&&mid=35884E6B4BFFDAB
2F00035884E6B4BFFDAB2F000&&FORM=VRDGAR
DBSCAN

https://www.bing.com/images/search?view=detailV2&ccid=z9asJXY5&id=D21C5B3AF4E34CE3433710B7AE71DBC8394C7621&thid=OIP.z9asJXY5UkR
JZ3NdZ5TpqAHaFg&mediaurl=https%3A%2F%2Fround-lake.dustinice.workers.dev%3A443%2Fhttps%2Fwww.researchgate.net%2Fprofile%2FYijin_Liu%2Fpublication%2F308750501%2Ffigure%2Fdownload%
2Ffig4%2FAS%3A412083041652736%401475259661770%2FSchematic-drawings-of-the-DBSCAN-clustering-algorithm-Panel-a-shows-the-
clustering.png&cdnurl=https%3A%2F%2Fround-lake.dustinice.workers.dev%3A443%2Fhttps%2Fth.bing.com%2Fth%2Fid%2FR.cfd6ac25763952444967735d6794e9a8%3Frik%3DIXZMOcjbca63EA%26pid%3DI
mgRaw%26r%3D0&exph=510&expw=685&q=DBSCAN+Clustering&simid=607997061279082409&form=IRPRST&ck=BFA17CD056C1FA9D9F0093F
C4AF169F1&selectedindex=3&ajaxhist=0&ajaxserp=0&vt=0
DBSCAN

diff(dist)

1/length(dist)
DBSCAN

KNNdistplot:

▪ Tries to find a best value for


eps based on KNN Distance.
▪ KNN Distance is defined as
the nearest distance from a
point to its K Nearest
Neighbors.
▪ We seek a value for eps at the
knee of the line.
DBSCAN
DBSCAN
HULLPLOT FOR DBSCAN

HullPlot
DBSCAN
plot(iris3, col=db1$cluster)
DBSCAN

https://www.bing.com/images/search?view=detailV2&ccid=3LdJGuOj&id=8CC2488C0F8B17E22DD76E9D59F2ACDFDA158EEE&thid=OIP.3LdJGuOjolHFG7m8W2NmXQHaCw&mediaurl=h
ttps%3a%2f%2fmiro.medium.com%2fmax%2f1200%2f1*KqWII7sFp1JL0EXwJGpqFw.png&cdnurl=https%3a%2f%2fround-lake.dustinice.workers.dev%3a443%2fhttps%2fth.bing.com%2fth%2fid%2fR.dcb7491ae3a3a251c51bb9bc5b63665d%3frik%3d7
o4V2t%252bs8lmdbg%26pid%3dImgRaw%26r%3d0&exph=448&expw=1200&q=dbscan&simid=607992534376195191&FORM=IRPRST&ck=D05FFCD01B189C12EA922B280124E7E8&selected
Index=3&ajaxhist=0&ajaxserp=0
KMEANS IN PYTHON

41
KMEANS IN PYTHON

42
KMEANS IN
PYTHON

43
DBSCAN IN PYTHON

The default values for


parameters:
eps=0.5,
min_samples=5

44
DBSCAN IN PYTHON

45
DBSCAN IN PYTHON

▪ algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’


▪ The algorithm to be used by the NearestNeighbors module to
compute pointwise distances and find nearest neighbors.

46
DBSCAN IN PYTHON

47
REFERENCE

 Data Science and Big Data analytics, Wiley, 2015. Chapter 4: Advanced analytical
theory and methods: Clustering.

You might also like