4 Clustring
4 Clustring
CLUSTERING
K-Means for
K=3
USE CASES
USE CASES
OVERVIEW OF THE METHOD
OVERVIEW OF THE METHOD
Step 1:
https://www.bing.com/videos/search?ptag=ICO-
a37ed4490a84afc3&pc=1MSC&q=kmeans+algorithm+video&ru=%2fsearch%3fptag%3dIC
O-
a37ed4490a84afc3%26form%3dINCOH1%26pc%3d1MSC%26q%3dkmeans%2520algorith
m%2520video&view=detail&mmscn=vwrc&mid=FFB271DF90375FC9E55CFFB271DF903
75FC9E55C&FORM=WRVORC
OVERVIEW OF THE METHOD
OVERVIEW OF THE METHOD
Step 2:
OVERVIEW OF THE METHOD
OVERVIEW OF THE
METHOD
Step 3:
OVERVIEW OF THE METHOD
DETERMINING THE NUMBER OF CLUSTERS
With the preceding algorithm, k clusters can be identified in a given dataset, but what
value of k should be selected?
The value of k can be chosen based on a reasonable guess or some predefined
requirement. However, even then, it would be good to know how much better or
worse having k clusters versus k – 1 or k + 1 clusters would be in explaining the
structure of the data. Next, a heuristic using the Within Sum of Squares (WSS) metric
is examined to determine a reasonably optimal value of k. Using the distance function
given in following equation, WSS is defined as shown in equation below:
DETERMINING THE NUMBER OF CLUSTERS
In other words, WSS is the sum of the squares of the distances between each data
point and the closest centroid. The term q(i) indicates the closest centroid that is
associated with the ith point. If the points are relatively close to their respective
centroids, the WSS is relatively small. Thus, if k + 1 clusters do not greatly reduce the
value of WSS from the case with only k clusters, there may be little benefit to adding
another cluster.
CLUSTERING-EXAMPLE
CLUSTERING-EXAMPLE
CLUSTERING-EXAMPLE
Size compare to
default size=1
Plot Character: Star
The k-means algorithm is sensitive to the starting positions of the initial centroid. Thus, it is important
to rerun the k-means analysis several times for a particular value of k to ensure the cluster results
provide the overall minimum WSS. As seen earlier, this task is accomplished in R by using the nstart
option in the kmeans() function call.
Here, the use of the Euclidean distance function is presented to assign the points to the closest
centroids. Other possible function choices include the Cosine similarity and the Manhattan distance
functions. The Cosine similarity function is often chosen to compare two documents based on the
frequency of each word that appears in each of the documents.
For two points, p and q, at (p1, p2, …, pn) and (q1, q2, …, qn) , respectively, the Manhattan distance, d1 ,
between p and q is expressed as shown in next Equation:
ADDITIONAL CONSIDERATIONS
The Manhattan distance function is analogous to the distance traveled by a car in a city, where the
streets are laid out in a rectangular grid (such as city blocks). In Euclidean distance, the measurement is
made in a straight line. Using previous equation, the distance from (1, 1) to (4, 5) would be |1 – 4| + |1
– 5| = 7. From an optimization perspective, if there is a need to use the Manhattan distance for a
clustering analysis, the median is a better choice for the centroid than use of the mean.
K-means clustering is applicable to objects that can be described by attributes that are numerical with
a meaningful distance measure. However, k-means does not handle categorical variables well.
For example, suppose a clustering analysis is to be conducted on new car sales. Among other attributes,
such as the sale price, the color of the car is considered important. Although one could assign numerical
values to the color, such as red = 1, yellow = 2, and green = 3, it is not useful to consider that yellow is
as close to red as yellow is to green from a clustering perspective. In such cases, it may be necessary to
use an alternative clustering methodology.
ADDITIONAL ALGORITHMS
The k-means clustering method is easily applied to numeric data where the concept of
distance can naturally be applied. However, it may be necessary or desirable to use an
alternative clustering algorithm.
As discussed at the end of the previous section, k-means does not handle categorical data.
In such cases, k-modes is a commonly used method for clustering categorical data based
on the number of differences in the respective components of the attributes.
For example, if each object has four attributes, the distance from (a, b, e, d) to (d, d, d, d) is 3.
In R, the function kmode() is implemented in the klaR package.
ADDITIONAL ALGORITHMS
Because k-means and k-modes divide the entire dataset into distinct groups, both approaches
are considered partitioning methods. A third partitioning method is known as Partitioning
around Medoids (PAM).
In general, a medoid is a representative object in a set of objects. In clustering, the medoids
are the objects in each cluster that minimize the sum of the distances from the medoid to the
other objects in the cluster.
The advantage of using PAM is that the “center” of each cluster is an actual object in
the dataset. PAM is implemented in R by the pam() function included in the cluster R
package. The fpc R package includes a function pamk(), which uses the pam() function to find
the optimal value for k.
ADDITIONAL ALGORITHMS
▪ Core Points
▪ Border/Boundary Points
▪ Noise/Outlier
https://www.bing.com/videos/search?q=dbscan+algorithm+video&qs=n&sp=-1&lq=0&pq=dbscan+algorithm+video&sc=6-
22&sk=&cvid=F20866D7EA994B10A858FF746FB4DD29&ghsh=0&ghacc=0&ghpl=&ru=%2fsearch%3fq%3ddbscan%2balgorithm%2bvideo
%26qs%3dn%26form%3dQBRE%26sp%3d-1%26lq%3d0%26pq%3ddbscan%2balgorithm%2bvideo%26sc%3d6-
22%26sk%3d%26cvid%3dF20866D7EA994B10A858FF746FB4DD29%26ghsh%3d0%26ghacc%3d0%26ghpl%3d&view=detail&mmscn=vwr
c&mid=4F8A4D63F261C954ABF64F8A4D63F261C954ABF6&FORM=WRVORC
https://www.bing.com/videos/riverview/relatedvideo?&q=dbscan&&mid=35884E6B4BFFDAB2F00035884E6B4BFFDAB2F000&&FORM=VRDGAR
https://www.bing.com/videos/riverview/relatedvideo?&q=dbscan&&mid=35884E6B4BFFDAB
2F00035884E6B4BFFDAB2F000&&FORM=VRDGAR
DBSCAN
https://www.bing.com/images/search?view=detailV2&ccid=z9asJXY5&id=D21C5B3AF4E34CE3433710B7AE71DBC8394C7621&thid=OIP.z9asJXY5UkR
JZ3NdZ5TpqAHaFg&mediaurl=https%3A%2F%2Fround-lake.dustinice.workers.dev%3A443%2Fhttps%2Fwww.researchgate.net%2Fprofile%2FYijin_Liu%2Fpublication%2F308750501%2Ffigure%2Fdownload%
2Ffig4%2FAS%3A412083041652736%401475259661770%2FSchematic-drawings-of-the-DBSCAN-clustering-algorithm-Panel-a-shows-the-
clustering.png&cdnurl=https%3A%2F%2Fround-lake.dustinice.workers.dev%3A443%2Fhttps%2Fth.bing.com%2Fth%2Fid%2FR.cfd6ac25763952444967735d6794e9a8%3Frik%3DIXZMOcjbca63EA%26pid%3DI
mgRaw%26r%3D0&exph=510&expw=685&q=DBSCAN+Clustering&simid=607997061279082409&form=IRPRST&ck=BFA17CD056C1FA9D9F0093F
C4AF169F1&selectedindex=3&ajaxhist=0&ajaxserp=0&vt=0
DBSCAN
diff(dist)
1/length(dist)
DBSCAN
KNNdistplot:
HullPlot
DBSCAN
plot(iris3, col=db1$cluster)
DBSCAN
https://www.bing.com/images/search?view=detailV2&ccid=3LdJGuOj&id=8CC2488C0F8B17E22DD76E9D59F2ACDFDA158EEE&thid=OIP.3LdJGuOjolHFG7m8W2NmXQHaCw&mediaurl=h
ttps%3a%2f%2fmiro.medium.com%2fmax%2f1200%2f1*KqWII7sFp1JL0EXwJGpqFw.png&cdnurl=https%3a%2f%2fround-lake.dustinice.workers.dev%3a443%2fhttps%2fth.bing.com%2fth%2fid%2fR.dcb7491ae3a3a251c51bb9bc5b63665d%3frik%3d7
o4V2t%252bs8lmdbg%26pid%3dImgRaw%26r%3d0&exph=448&expw=1200&q=dbscan&simid=607992534376195191&FORM=IRPRST&ck=D05FFCD01B189C12EA922B280124E7E8&selected
Index=3&ajaxhist=0&ajaxserp=0
KMEANS IN PYTHON
41
KMEANS IN PYTHON
42
KMEANS IN
PYTHON
43
DBSCAN IN PYTHON
44
DBSCAN IN PYTHON
45
DBSCAN IN PYTHON
46
DBSCAN IN PYTHON
47
REFERENCE
Data Science and Big Data analytics, Wiley, 2015. Chapter 4: Advanced analytical
theory and methods: Clustering.