0% found this document useful (0 votes)
38 views18 pages

ML Exp5 C36

The document discusses implementing K-means clustering using Python. It includes code to perform K-means clustering on three datasets: Mall Customers, Iris, and Housing. The code calculates inertia for different numbers of clusters on Mall Customers data and plots the clustered data. It also performs clustering and visualizes the results for the Iris and Housing datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views18 pages

ML Exp5 C36

The document discusses implementing K-means clustering using Python. It includes code to perform K-means clustering on three datasets: Mall Customers, Iris, and Housing. The code calculates inertia for different numbers of clusters on Mall Customers data and plots the clustered data. It also performs clustering and visualizes the results for the Iris and Housing datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Prathmesh Gaikwad

TUS3F202128 C36

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment No. 5
A.1 Aim:
To implement K-means clustering

A.2 Prerequisite:
Python Basic Concepts

A.3 Outcome:
Students will be able To implement K-means clustering.

A.4 Theory:

K-means clustering is one of the most widely used unsupervised machine learning
algorithms that forms clusters of data based on the similarity between data instances.
For this particular algorithm to work, the number of clusters has to be defined
beforehand. The K in the K-means refers to the number of clusters.

The K-means algorithm starts by randomly choosing a centroid value for each
cluster. After that the algorithm iteratively performs three steps: (i) Find the
Euclidean distance between each data instance and centroids of all the clusters; (ii)
Assign the data instances to the cluster of the centroid with nearest distance; (iii)
Calculate new centroid values based on the mean values of the coordinates of all the
data instances from the corresponding cluster.

Hierarchical Based Methods : The clusters formed in this method forms a tree type
structure based on the hierarchy. New clusters are formed using the previously
formed one. It is divided into two category.
Prathmesh Gaikwad
TUS3F202128 C36

Agglomerative (bottom up approach)

Divisive (top down approach) .

Agglomerative Clustering:

Agglomerative algorithms start with each individual item in its own cluster and
iteratively merge clusters until all items belong in one cluster. Different
agglomerative algorithms differ in how the clusters are merged at each level.
Outputting the dendrogram produces a set of clusters rather than just one clustering.
The user can determine which of the clusters (based on distance threshold) he or she
wishes to use.

Agglomerative Algorithm

Compute the distance matrix between the input data points Let each data point be a
cluster.

Repeat

Merge the two closest clusters

Update the distance matrix


Prathmesh Gaikwad
TUS3F202128 C36

Distance between two clusters Each cluster is a set of points. In following ways
distance is defined in clusters.

Single Link:

Distance between clusters Ci and Cj is the minimum distance between any object in
Ci and any object in Cj.

Complete Link:

Distance between clusters Ci and Cj is the maximum distance between any object in
Ci and any object in Cj
Prathmesh Gaikwad
TUS3F202128 C36

Average Link:

Distance between clusters Ci and Cj is the average distance between any object in
Ci and any object in Cj
Prathmesh Gaikwad
TUS3F202128 C36

PART B
(PART B : TO BE COMPLETED BY STUDENTS)

Roll No: BE-C36 Name: Prathmesh Krishna Gaikwad


Class: BE-Comps Batch: C2
Date of Experiment: 26/09/2023 Date of Submission: 26/09/2023
Grade:

B.1 Software Code written by student:


1. Mall_Customers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import plotly as py
import plotly.graph_objs as go

from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('Mall_Customers.csv')
df.head()
df.columns
X1 = df[['Age' , 'Spending Score (1-100)']].iloc[: , :].values
inertia = []
for n in range(1 , 15):
algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300,
tol=0.0001, random_state= 111 , algorithm='elkan') )
algorithm.fit(X1)
inertia.append(algorithm.inertia_)
algorithm = (KMeans(n_clusters = 4 ,init='k-means++', n_init = 10 ,max_iter=300,
tol=0.0001, random_state= 111 , algorithm='elkan') )
algorithm.fit(X1)
labels1 = algorithm.labels_
centroids1 = algorithm.cluster_centers_
h = 0.02
x_min, x_max = X1[:, 0].min() - 1, X1[:, 0].max() + 1
y_min, y_max = X1[:, 1].min() - 1, X1[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
Prathmesh Gaikwad
TUS3F202128 C36

plt.figure(1 , figsize = (15 , 7) )


plt.clf()
Z = Z.reshape(xx.shape)
plt.imshow(Z , interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')

plt.scatter( x = 'Age', y = 'Spending Score (1-100)', data = df, c = labels1, s = 100)


plt.scatter(x = centroids1[: , 0] , y = centroids1[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.ylabel('Spending Score (1-100)') , plt.xlabel('Age')
plt.show()

2. Iris
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("/content/Iris.csv")
df.head()
df['Species'],categories =pd.factorize(df['Species'])
df.head()
df.describe
df.isna().sum()
sns.scatterplot(data=df, x="SepalLengthCm", y="SepalWidthCm",hue="Species");
sns.scatterplot(data=df, x="PetalLengthCm", y="PetalWidthCm",hue="Species");

3. Housing
import pandas as pd
home_data = pd.read_csv('housing.csv', usecols = ['longitude', 'latitude', 'median_house_value'])
home_data.head()

import seaborn as sns


sns.scatterplot(data = home_data, x = 'longitude', y = 'latitude', hue = 'median_house_value')

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(home_data[['latitude', 'longitude']],
home_data[['median_house_value']], test_size=0.33, random_state=0)

from sklearn import preprocessing


X_train_norm = preprocessing.normalize(X_train)
X_test_norm = preprocessing.normalize(X_test)

from sklearn import KMeans


kmeans = KMeans(n_clusters = 3, random_state = 0, n_init='auto')
Prathmesh Gaikwad
TUS3F202128 C36

kmeans.fit(X_train_norm)

sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude', hue = kmeans.labels_)


sns.boxplot(x = kmeans.labels_, y = y_train['median_house_value'])

from sklearn.metrics import silhouette_score


silhouette_score(X_train_norm, kmeans.labels_, metric='euclidean')

K = range(2, 8)
fits = []
score = []

for k in K:
# train the model for current value of k on training data
model = KMeans(n_clusters = k, random_state = 0, n_init='auto').fit(X_train_norm)
# append the model to fits
fits.append(model)
# Append the silhouette score to scores
score.append(silhouette_score(X_train_norm, model.labels_, metric='euclidean'))

sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude', hue = fits[0].labels_)


sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude', hue = fits[2].labels_)
sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude', hue = fits[2].labels_)
sns.lineplot(x = K, y = score)
sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude', hue = fits[3].labels_)
sns.boxplot(x = fits[3].labels_, y = y_train['median_house_value'])

B.2 Input and Output:


1. Mall_Customers
Prathmesh Gaikwad
TUS3F202128 C36

2. Iris
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36

3. Housing
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36

B.3 Observations and learning:


k-means clustering is a method of vector quantization, originally from signal processing, that
aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the
cluster.

B.4 Conclusion
Hence, we successfully learned & To implemented K-means clustering.

B.5 Question of Curiosity (Handwritten any 3)


1. What is Agglomerative clustering? Explain in detail with algorithm.
2. Explain Divisive clustering (top-down approach).
3. What are the limitations while implementing K-means clustering?
4. Explain the steps involved in clustering the data using K-means clustering algorithm?
5. How are centroids calculated using K-means clustering algorithm?
6. What are the disadvantages of K-means?
7. How is K means clustering achieved in python dataframe?
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36
Prathmesh Gaikwad
TUS3F202128 C36

You might also like