Unit 4 Machine Learning
Unit 4 Machine Learning
Unsupervised Learning
Clustering
The process of grouping a set of physical or abstract objects into classes of similar objects
is called clustering. It is an unsupervised learning technique. A cluster is a collection of
data objects that are similar to one another within the same cluster and are dissimilar to
the objects in other clusters. Clustering can also be used for outlier detection, where
outliers may be more interesting than common cases. Applications of outlier detection
include the detection of credit card fraud and the monitoring of criminal activities in
electronic commerce. For example, exceptional cases in credit card transactions, such as
very expensive and frequent purchases, may be of interest as possible fraudulent activity.
Another well-known metric is Manhattan (or city block) distance, defined as below.
d (x, y) x2 x1 y2 y1
d (x, y) x x
2 1
p
y2 y1
p
1/ p
K-Means Algorithm
M-Means is one of the simplest partitioning based clustering algorithm. The procedure
follows a simple and easy way to group a given data set into a certain number of
clusters (assume k clusters) fixed Apriori. The main idea is to define k centers, one for each
cluster. These centers should be selected cleverly because of different location causes
different result. So, the better choice is to place them as much as possible far away
from each other.
Algorithm
Let X = {x1,x2,x3,……..,xn} be the set of data points and C = {c 1,c2,…….,ck} be the cluster
centers.
1. Select k cluster centers randomly
3. Assign the data point to the cluster which is closest to the data point.
4. If No data is reassigned
• Display Clusters
• Terminate
5. Else
• Go to step 2
Numerical Example
Divide the data points {(2,10), ((2,5), (8,4), (5,8), (7,5), (6,4)} into two clusters.
Solution
Let p1=(2,10) p2=(2,5) p3=(8,4) p4=(5,8) p5=(7,5) p6=(6,4)
Initial step
Choose Cluster centers randomly
d(c1,p2)=0 d(c2,p2)=4.12
d(c1,p3)=6.08 d(c2,p3)=2
d(c1,p4)=4.24 d(c2,p4)=4.12
d(c1,p5)=5 d(c2,p5)=1.41
d(c1,p6)=4.12 d(c2,p6)=0
Again, Calculate distance between clusters centers and each data points
d(c1,p2)=2.5 d(c2,p2)=4.51
d(c1,p3)=6.95 d(c2,p3)=1.95
d(c1,p4)=3.04 d(c2,p4)=3.13
d(c1,p5)=4.59 d(c2,p5)=0.56
d(c1,p6)=5.32 d(c2,p6)=1.35
Again, Calculate distance between clusters centers and each data points
d(c1,p1)=2.54 d(c2,p1=7.56
d(c1,p2)=2.85 d(c2,p2)=5.04
d(c1,p3)=6.2 d(c2,p3)=1.05
d(c1,p4)=2.03 d(c2,p4)=4.18
d(c1,p5)=4.81 d(c2,p5)=0.67
d(c1,p6)=4.74 d(c2,p6)=1.05
data=[(2,8),(3,2),(1,4),(4,6),(3,5),(2,3),(5,7),(4,8),(4,2),(1,3),(4,5),(3
,4),(7,4),(2,1)]
km=KMeans(n_clusters=2,init='random')
km.fit(data)
centers = km.cluster_centers_
labels = km.labels_
print("Cluster Centers:",*centers)
print("Cluster Labels:",*labels)
#Diaplaying Clusters
cluster1=[]
cluster2=[]
for i in range(len(labels)):
if (labels[i]==0):
cluster1.append(data[i])
else:
cluster2.append(data[i])
print("Cluster 1:",cluster1)
print("Cluster 2:",cluster2)
K-medoid algorithm selects k medoids (cluster centers) randomly and swaps each
medoid with each non-medoid data point. The swap is accepted only when total cost is
decreased. Total cost is the sum of all the distances from all the data points to the medoids
and is calculated as below.
c p m i i
mi pimi
Algorithm
1. Select k medoids randomly.
2. Assign each data point to the closest medoid.
Numerical Example
Divide the data points {(2,10), ((2,5), (8,4), (5,8), (7,5), (6,4)} into two clusters.
Solution
Let p1=(2,10) p2=(2,5) p3=(8,4) p4=(5,8) p5=(7,5)
p6=(6,4)
Initial step
Let m1=(2,5) and m2=(6,4) are two initial cluster centers (medoid).
Iteration 1
Calculate distance between medoids and each data points
d(m1,p1)=5 d(m2,p1)=10
d(m1,p2)=0 d(m2,p2)=5
d(m1,p3)=7 d(m2,p3)=2
d(m1,p4)=6 d(m2,p4)=5
d(m1,p5)=5 d(m2,p5)=2
d(m1,p6)=5 d(m2,p6)=0
Thus, Cluster1={p1,p2} cluster2={p3,p4,p5,p6}
Total Cost=5+0+2+5+2+0=14
d(m1,p3)=12 d(m2,p3)=2
d(m1,p4)=5 d(m2,p4)=5
d(m1,p5)=10 d(m2,p5)=2
d(m1,p6)=10 d(m2,p6)=0
Iteration 3:
Swap m1 with p3, m1 =(8,4) m2=(6,4)
d(m1,p1)=12 d(m2,p1)=10
d(m1,p2)=7 d(m2,p2)=5
d(m1,p3)=0 d(m2,p3)=2
d(m1,p4)=7 d(m2,p4)=5
d(m1,p5)=2 d(m2,p5)=2
d(m1,p6)=2 d(m2,p6)=0
data=[(2,8),(3,2),(1,4),(4,6),(3,5),(2,3),(5,7),(4,8),(4,2),(1,3),(4,5),(3
,4),(7,4),(2,1)]
km=KMedoids(n_clusters=2)
km.fit(data)
centers = km.cluster_centers_
labels = km.labels_
print("Cluster Centers:",*centers)
print("Cluster Labels:",*labels)
#Diaplaying Clusters
cluster1=[]
cluster2=[]
for i in range(len(labels)):
if (labels[i]==0):
cluster1.append(data[i])
else:
cluster2.append(data[i])
print("Cluster 1:",cluster1)
print("Cluster 2:",cluster2)
Hierarchical Clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster
analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of
clusters. Strategies for hierarchical clustering generally fall into two types: Agglomerative
and Divisive. Agglomerative Clustering is a bottom up approach. Initially, each
observation is considered in separate cluster and pairs of clusters are merged as one
moves up the hierarchy. This process continues until the single cluster or required
number of clusters are formed. Distance matrix is used for deciding which clusters to
merge.
The closest cluster are cluster {F} and {D} with shortest distance of 0.5. Thus, we
group cluster D and F into single cluster {D, F}.
Update the Distance Matrix
We can see that the distance between cluster {B} and cluster {A} is minimum with
distance 0.71. Thus, we group cluster {A} and cluster {B} into a single cluster
named {A, B}.
After that, we merge cluster {D, E, E} and cluster {C} into a new cluster {C, D, E, F}
because cluster {D, E, E} and cluster {C} are closest clusters with distance 1.41.
#Diaplaying Clusters
cluster1=[]
cluster2=[]
for i in range(len(labels)):
if (labels[i]==0):
cluster1.append(data[i])
else:
cluster2.append(data[i])
print("Cluster 1:",cluster1)
print("Cluster 2:",cluster2)