0% found this document useful (0 votes)
4 views

Unit 4 Machine Learning

Clustering is an unsupervised learning technique that groups similar objects into clusters, with applications in outlier detection like credit card fraud. Major clustering algorithms include partitioning methods, hierarchical methods, density-based methods, and model-based methods, each with unique approaches to grouping data. The document also details the K-Means and K-Medoids algorithms, providing examples and Python code for implementation.

Uploaded by

Nischal Ghimire
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 4 Machine Learning

Clustering is an unsupervised learning technique that groups similar objects into clusters, with applications in outlier detection like credit card fraud. Major clustering algorithms include partitioning methods, hierarchical methods, density-based methods, and model-based methods, each with unique approaches to grouping data. The document also details the K-Means and K-Medoids algorithms, providing examples and Python code for implementation.

Uploaded by

Nischal Ghimire
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit 4

Unsupervised Learning
Clustering

The process of grouping a set of physical or abstract objects into classes of similar objects
is called clustering. It is an unsupervised learning technique. A cluster is a collection of
data objects that are similar to one another within the same cluster and are dissimilar to
the objects in other clusters. Clustering can also be used for outlier detection, where
outliers may be more interesting than common cases. Applications of outlier detection
include the detection of credit card fraud and the monitoring of criminal activities in
electronic commerce. For example, exceptional cases in credit card transactions, such as
very expensive and frequent purchases, may be of interest as possible fraudulent activity.

Categories of Clustering Algorithms


Many clustering algorithms exist in the literature. In general, the major clustering
methods can be classified into the following categories.

1. Partitioning methods: Given a database of n objects or data tuples, a


partitioning method constructs k partitions of the data, where each partition
represents a cluster and k <n. Given k, the number of partitions to construct, a
partitioning method creates an initial partitioning. It then uses an iterative
relocation technique that attempts to improve the partitioning by moving objects
from one group to another.

2. Hierarchical methods: A hierarchical method creates a hierarchical


decomposition of the given set of data objects. A hierarchical method can be
classified as being either agglomerative or divisive. The agglomerative approach
follows the bottom-up approach. It starts with each object forming a separate group.
It successively merges the objects or groups that are close to one another, until a
termination condition holds. The divisive approach follows the top-down approach.
It starts with all of the objects in the same cluster. In each successive iteration, a
cluster is split up into smaller clusters, until a termination condition holds.

3. Density-based methods: Most partitioning methods cluster objects based on the


distance between objects. Such methods can find only spherical-shaped clusters
and encounter difficulty at discovering clusters of arbitrary shapes. Other
clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density
(number of objects or data points) in the neighborhood exceeds some threshold.

Prepared By: Arjun Singh Saud


4. Model-based methods: Model-based methods hypothesize a model for each of the
clusters and find the best fit of the data to the given model. EM is an algorithm
that performs expectation-maximization analysis based on statistical modeling.
Measures of Similarity
Distance measures are used in order to find similarity or dissimilarity between data
objects. The most popular distance measure is Euclidean distance, which is defined as
below.

d (x, y)  (x 2  x 1)2  ( y 2  y 1)2

Where, x  (x1, y1 ) and y  (x2 , y2 )

Another well-known metric is Manhattan (or city block) distance, defined as below.

d (x, y)  x2  x1  y2  y1

Minkowski distance is a generalization of both Euclidean distance and Manhattan


distance. It is defined as below.

d (x, y)  x  x
2 1
p
 y2  y1
p
 1/ p

Where, p is a positive integer, such a distance is also called Lp norm, in some


literature.
It represents the Manhattan distance when p = 1 (i.e., L 1 norm) and Euclidean distance
when p = 2 (i.e., L2 norm).

K-Means Algorithm
M-Means is one of the simplest partitioning based clustering algorithm. The procedure
follows a simple and easy way to group a given data set into a certain number of
clusters (assume k clusters) fixed Apriori. The main idea is to define k centers, one for each
cluster. These centers should be selected cleverly because of different location causes
different result. So, the better choice is to place them as much as possible far away
from each other.
Algorithm
Let X = {x1,x2,x3,……..,xn} be the set of data points and C = {c 1,c2,…….,ck} be the cluster
centers.
1. Select k cluster centers randomly

Prepared By: Arjun Singh Saud


2. Calculate the distance between each data point and cluster centers.

3. Assign the data point to the cluster which is closest to the data point.

4. If No data is reassigned

• Display Clusters

• Terminate

5. Else

• Calculate centroid of each cluster and set cluster centers to centroids.

• Go to step 2

Numerical Example
Divide the data points {(2,10), ((2,5), (8,4), (5,8), (7,5), (6,4)} into two clusters.
Solution
Let p1=(2,10) p2=(2,5) p3=(8,4) p4=(5,8) p5=(7,5) p6=(6,4)
Initial step
Choose Cluster centers randomly

Let c1=(2,5) and c2=(6,4) are two initial cluster centers.


Iteration 1
Calculate distance between clusters centers and each data points
d(c1,p1)=5 d(c2,p1)=7.21

d(c1,p2)=0 d(c2,p2)=4.12

d(c1,p3)=6.08 d(c2,p3)=2

d(c1,p4)=4.24 d(c2,p4)=4.12

d(c1,p5)=5 d(c2,p5)=1.41

d(c1,p6)=4.12 d(c2,p6)=0

Thus, Cluster1={p1,p2} cluster2={p3,p4,p5,p6}


Iteration 2
New Cluster centers: c1=(2,7.5) c2=(6.5,5.25)

Again, Calculate distance between clusters centers and each data points

Prepared By: Arjun Singh Saud


d(c1,p1)=2.5 d(c2,p1)=6.54

d(c1,p2)=2.5 d(c2,p2)=4.51

d(c1,p3)=6.95 d(c2,p3)=1.95

d(c1,p4)=3.04 d(c2,p4)=3.13

d(c1,p5)=4.59 d(c2,p5)=0.56

d(c1,p6)=5.32 d(c2,p6)=1.35

Thus, Cluster1={p1,p2,p4} cluster2={p3,p5,p6}


Iteration 3
New Cluster centers: c1=(3,7.67) c2=(7,4.33)

Again, Calculate distance between clusters centers and each data points
d(c1,p1)=2.54 d(c2,p1=7.56
d(c1,p2)=2.85 d(c2,p2)=5.04

d(c1,p3)=6.2 d(c2,p3)=1.05

d(c1,p4)=2.03 d(c2,p4)=4.18

d(c1,p5)=4.81 d(c2,p5)=0.67

d(c1,p6)=4.74 d(c2,p6)=1.05

Thus, Cluster1={p1,p2,p4} cluster2={p3,p5,p6}


No data points are re-assigned
Thus, final clusters are: Cluster1={p1,p2,p4} cluster2={p3,p5,p6}

In python, KMeans method from sklearn.cluster module used to create instance of


KMean algorithm. Two major parameters of this method are n_clusters and init . The
parameter n_clusters is used specify the value of k (number of clusters) and init
parameter is used to specify the initialization method. In case of KMeans init=’random’ is
used. Once the instance of KMean is created, fit() method is used to compute clusters by
KMeans algorithm. This method accepts dataset as input argument and stores final
cluster centers in variable cluster_centers_ and cluster labels of the dataset in labels_
variable.
Example
import numpy as np
import matplotlib.pyplot as plt

Prepared By: Arjun Singh Saud


from sklearn.cluster import KMeans

data=[(2,8),(3,2),(1,4),(4,6),(3,5),(2,3),(5,7),(4,8),(4,2),(1,3),(4,5),(3
,4),(7,4),(2,1)]
km=KMeans(n_clusters=2,init='random')
km.fit(data)
centers = km.cluster_centers_
labels = km.labels_
print("Cluster Centers:",*centers)
print("Cluster Labels:",*labels)
#Diaplaying Clusters
cluster1=[]
cluster2=[]
for i in range(len(labels)):
if (labels[i]==0):
cluster1.append(data[i])
else:
cluster2.append(data[i])
print("Cluster 1:",cluster1)
print("Cluster 2:",cluster2)

KMedoid Clustering Algorithm


K-Medoids is also a portioning based clustering algorithm. It is also called as partitioning
around medoid (PAM) algorithm. A medoid can be defined as the point in the cluster,
whose dissimilarities with all the other points in the cluster is minimum. It majorly differs
from the K-Means algorithm in terms of the way it selects the cluster centers. K-Means
algorithm selects the average of a cluster’s points as its center whereas the K-Medoid
algorithm always picks the actual data points from the clusters as their centers.

K-medoid algorithm selects k medoids (cluster centers) randomly and swaps each
medoid with each non-medoid data point. The swap is accepted only when total cost is
decreased. Total cost is the sum of all the distances from all the data points to the medoids
and is calculated as below.

c  p m i i
mi pimi

Where, mi is medoid point and pi is non-medoid data point

Algorithm
1. Select k medoids randomly.
2. Assign each data point to the closest medoid.

Prepared By: Arjun Singh Saud


3. Compute Total cost

4. For each medoid m

For each non-medoid point p


Swap m and p

Assign each data point to the closest medoid

Compute total cost

If the total cost is more than that in the previous step

 Undo the swap.
5. Display clusters

Numerical Example
Divide the data points {(2,10), ((2,5), (8,4), (5,8), (7,5), (6,4)} into two clusters.
Solution
Let p1=(2,10) p2=(2,5) p3=(8,4) p4=(5,8) p5=(7,5)
p6=(6,4)
Initial step
Let m1=(2,5) and m2=(6,4) are two initial cluster centers (medoid).
Iteration 1
Calculate distance between medoids and each data points
d(m1,p1)=5 d(m2,p1)=10
d(m1,p2)=0 d(m2,p2)=5

d(m1,p3)=7 d(m2,p3)=2

d(m1,p4)=6 d(m2,p4)=5

d(m1,p5)=5 d(m2,p5)=2

d(m1,p6)=5 d(m2,p6)=0
Thus, Cluster1={p1,p2} cluster2={p3,p4,p5,p6}
Total Cost=5+0+2+5+2+0=14

Prepared By: Arjun Singh Saud


Iteration 2:
Swap m1 with p1, m1 =(2,10) m2=(6,4)

Calculate distance between medoids and each data points


d(m1,p1)=0 d(m2,p1)=10
d(m1,p2)=5 d(m2,p2)=5

d(m1,p3)=12 d(m2,p3)=2

d(m1,p4)=5 d(m2,p4)=5

d(m1,p5)=10 d(m2,p5)=2

d(m1,p6)=10 d(m2,p6)=0

Thus, Cluster1={p1,p2,p4} cluster2={p3,p5,p6}


Total Cost=0+5+2+5+2+0=14

Iteration 3:
Swap m1 with p3, m1 =(8,4) m2=(6,4)

Calculate distance between medoids and each data points

d(m1,p1)=12 d(m2,p1)=10
d(m1,p2)=7 d(m2,p2)=5

d(m1,p3)=0 d(m2,p3)=2

d(m1,p4)=7 d(m2,p4)=5

d(m1,p5)=2 d(m2,p5)=2

d(m1,p6)=2 d(m2,p6)=0

Thus, Cluster1={p3,p5} cluster2={p1,p2,p4,p6}

Total Cost=10+5+0+5+2+0=22 => Undo Swapping

Continue this process…

In python, KMedoids method from sklearn_extra.cluster module used to create instance of


KMedoid algorithm. This module may not come with python default installation.
Therefore, we may need to install the module. One of the major parameter of this method
is n_clusters. The parameter n_clusters is used specify the value of k (number of clusters).

Prepared By: Arjun Singh Saud


Once the instance of KMedoid is created, like KMeans, fit() method is used to compute
clusters by KMedoid algorithm. This method accepts dataset as input argument and
stores final cluster centers in variable cluster_centers_ and cluster labels of the dataset in
labels_ variable.
Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn_extra.cluster import KMedoids

data=[(2,8),(3,2),(1,4),(4,6),(3,5),(2,3),(5,7),(4,8),(4,2),(1,3),(4,5),(3
,4),(7,4),(2,1)]
km=KMedoids(n_clusters=2)
km.fit(data)
centers = km.cluster_centers_
labels = km.labels_
print("Cluster Centers:",*centers)
print("Cluster Labels:",*labels)
#Diaplaying Clusters
cluster1=[]
cluster2=[]
for i in range(len(labels)):
if (labels[i]==0):
cluster1.append(data[i])
else:
cluster2.append(data[i])
print("Cluster 1:",cluster1)
print("Cluster 2:",cluster2)

Hierarchical Clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster
analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of
clusters. Strategies for hierarchical clustering generally fall into two types: Agglomerative
and Divisive. Agglomerative Clustering is a bottom up approach. Initially, each
observation is considered in separate cluster and pairs of clusters are merged as one
moves up the hierarchy. This process continues until the single cluster or required
number of clusters are formed. Distance matrix is used for deciding which clusters to
merge.

Prepared By: Arjun Singh Saud


A cluster hierarchy can also be generated top-down. This variant of hierarchical
clustering is called top-down clustering or divisive clustering. We start at the top with all
data in one cluster. The cluster is split two clusters such that the objects in one subgroup
are far from the objects in the other. This procedure is applied recursively until required
numbers of clusters are formed. This method is not considered attractive because there
exist O(2n) ways of splitting each cluster.

Agglomerative Clustering Algorithm

1. Compute the distance matrix between the input data points


2. Let each data point be a cluster
3. Repeat steps 4 and 5 until only K clusters remains

Prepared By: Arjun Singh Saud


4. Merge the two closest clusters
5. Update the distance matrix
Example
Cluster the data points (1,1), (1.5,1.5), (5,5), (3,4), (4,4), (3, 3.5) into two clusters.
Solution
A=(1,1), B= (1.5,1.5), C=(5,5), D=(3,4), E=(4,4), F=(3,3.5)
Distance Matrix

The closest cluster are cluster {F} and {D} with shortest distance of 0.5. Thus, we
group cluster D and F into single cluster {D, F}.
Update the Distance Matrix

We can see that the distance between cluster {B} and cluster {A} is minimum with
distance 0.71. Thus, we group cluster {A} and cluster {B} into a single cluster
named {A, B}.

Updated Distance Matrix

Prepared By: Arjun Singh Saud


We can see that the distance between clusters {E} and cluster {D, F} is minimum
with distance 1.00. Thus, we group them together into cluster {D, E, F}.
Updated Distance Matrix

After that, we merge cluster {D, E, E} and cluster {C} into a new cluster {C, D, E, F}
because cluster {D, E, E} and cluster {C} are closest clusters with distance 1.41.

Updated Distance Matrix

Now, we have only two clusters.

Thus, Final clusters are: {A, B} and {C, D, E, F}

In python, AgglomerativeClustering method from sklearn.cluster module used to create


instance of Agglomerative clustering algorithm. One of the major parameter of this
method is n_clusters. The parameter n_clusters is used specify the value of k (number
of clusters). Once the instance of AgglomerativeClustering is created, fit() method is
used to compute clusters by Agglomerative clustering algorithm. This method accepts
dataset as input argument and stores cluster labels of the dataset in labels_ variable.
Example

Prepared By: Arjun Singh Saud


import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering data=[(2,8),(3,2),
(1,4),(4,6),(3,5),(2,3),(5,7),(4,8),(4,2),(1,3),(4,5),(3
,4),(7,4),(2,1)]
ac=AgglomerativeClustering(n_clusters=2)
ac.fit(data)
labels = ac.labels_
print("Cluster Labels:",*labels)

#Diaplaying Clusters
cluster1=[]
cluster2=[]
for i in range(len(labels)):
if (labels[i]==0):
cluster1.append(data[i])
else:
cluster2.append(data[i])
print("Cluster 1:",cluster1)
print("Cluster 2:",cluster2)

Prepared By: Arjun Singh Saud

You might also like