100% found this document useful (1 vote)
164 views

Clustering - Jupyter Notebook

This document is a Jupyter Notebook that explores k-means clustering on the iris dataset. It loads the iris data, visualizes the relationships between features, implements a custom k-means clustering algorithm from scratch, and visualizes the cluster centers found for different numbers of iterations of the algorithm. The notebook loads data, explores relationships between features, implements a basic k-means algorithm, and visualizes how the cluster centers change with more iterations of the algorithm.

Uploaded by

reema dsouza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
164 views

Clustering - Jupyter Notebook

This document is a Jupyter Notebook that explores k-means clustering on the iris dataset. It loads the iris data, visualizes the relationships between features, implements a custom k-means clustering algorithm from scratch, and visualizes the cluster centers found for different numbers of iterations of the algorithm. The notebook loads data, explores relationships between features, implements a basic k-means algorithm, and visualizes how the cluster centers change with more iterations of the algorithm.

Uploaded by

reema dsouza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook

Clustering

k-means Clustering
https://en.wikipedia.org/wiki/K-means_clustering (https://en.wikipedia.org/wiki/K-means_clustering)

UCI Machine Learning Repository


https://archive.ics.uci.edu/ml/datasets.php
(https://archive.ics.uci.edu/ml/datasets.php)

In [1]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style

In [2]: style.use('default')

localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 1/11


9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook

In [3]: iris_df = sns.load_dataset('iris')


iris_df

Out[3]: sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 virginica

146 6.3 2.5 5.0 1.9 virginica

147 6.5 3.0 5.2 2.0 virginica

148 6.2 3.4 5.4 2.3 virginica

149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

In [4]: iris_df['species'].value_counts()

Out[4]: virginica 50

setosa 50

versicolor 50

Name: species, dtype: int64

localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 2/11


9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook

In [5]: plt.figure(figsize = (7,7))


sns.scatterplot(data = iris_df, x = 'sepal_length', y = 'sepal_width', hue = 'spe

Out[5]: <AxesSubplot:xlabel='sepal_length', ylabel='sepal_width'>

localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 3/11


9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook

In [6]: plt.figure(figsize = (7,7))


sns.scatterplot(data = iris_df, x = 'sepal_length', y = 'petal_width', hue = 'spe

Out[6]: <AxesSubplot:xlabel='sepal_length', ylabel='petal_width'>

localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 4/11


9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook

In [7]: plt.figure(figsize = (5,5))


sns.scatterplot(data = iris_df, x = 'petal_length', y = 'petal_width', hue = 'spe

Out[7]: <AxesSubplot:xlabel='petal_length', ylabel='petal_width'>

Raw Coding the k Means Clustering Algorithm

localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 5/11


9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook

In [8]: from sklearn.metrics import pairwise_distances_argmin



def find_clusters(X, n_clusters, rseed=0, num_iter = 100):
# 1. Randomly choose clusters
rng = np.random.RandomState(rseed)
i = rng.permutation(X.shape[0])[:n_clusters]
centers = X[i]

iter = 1
while True:
# 2a. Assign labels based on closest center
labels = pairwise_distances_argmin(X, centers)

# 2b. Find new centers from means of points


new_centers = np.array([X[labels == i].mean(0)
for i in range(n_clusters)])

# 2c. Check for convergence


print(num_iter, iter)
iter +=1
if iter > num_iter:
break

if np.all(centers == new_centers):
break

centers = new_centers

return centers, labels



X = iris_df.iloc[:, :-1].to_numpy()
centers = []
labels = []
for i in [1, 2, 5, 10]:
out_center, out_label = find_clusters(X, 3, num_iter = i, rseed=0)
centers.append(out_center)
labels.append(out_label)

1 1

2 1

2 2

5 1

5 2

5 3

5 4

5 5

10 1

10 2

10 3

10 4

10 5

10 6

10 7

localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 6/11


9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook

10 8

10 9

In [9]: iris_df

Out[9]: sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 virginica

146 6.3 2.5 5.0 1.9 virginica

147 6.5 3.0 5.2 2.0 virginica

148 6.2 3.4 5.4 2.3 virginica

149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 7/11


9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook

In [10]: import matplotlib.gridspec as gridspec


fig2 = plt.figure(constrained_layout=True, figsize = (7,7))
spec2 = gridspec.GridSpec(ncols=2, nrows=2, figure=fig2)
f2_ax1 = fig2.add_subplot(spec2[0, 0])
sns.scatterplot(ax = f2_ax1, x = iris_df['sepal_length'], y = iris_df['petal_widt
f2_ax1.scatter(centers[0][:, 0], centers[0][:, -1], marker = '*', color = 'royalb
f2_ax1.set_title('Number of Iterations = 1')

f2_ax2 = fig2.add_subplot(spec2[0, 1])
sns.scatterplot(ax = f2_ax2, x = iris_df['sepal_length'], y = iris_df['petal_widt
f2_ax2.scatter(centers[1][:, 0], centers[1][:, -1], marker = '*', color = 'royalb
f2_ax2.set_title('Number of Iterations = 2')

f2_ax3 = fig2.add_subplot(spec2[1, 0])
sns.scatterplot(ax = f2_ax3, x = iris_df['sepal_length'], y = iris_df['petal_widt
f2_ax3.scatter(centers[2][:, 0], centers[2][:, -1], marker = '*', color = 'royalb
f2_ax3.set_title('Number of Iterations = 5')

f2_ax4 = fig2.add_subplot(spec2[1, 1])
sns.scatterplot(ax = f2_ax4, x = iris_df['sepal_length'], y = iris_df['petal_widt
f2_ax4.scatter(centers[3][:, 0], centers[3][:, -1], marker = '*', color = 'royalb
f2_ax4.set_title('Number of Iterations = 10')
fig2.suptitle('Clustering at various iterations')
plt.show()

localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 8/11


9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook

Using sklearn

https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans
(https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)

localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 9/11


9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook

In [11]: from sklearn.cluster import KMeans


kmc = KMeans(n_clusters=3, max_iter=600, algorithm = 'full')
X = iris_df.iloc[:, :-1]
kmc.fit(X)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:882: User
Warning: KMeans is known to have a memory leak on Windows with MKL, when there
are less chunks than available threads. You can avoid it by setting the environ
ment variable OMP_NUM_THREADS=1.

f"KMeans is known to have a memory leak on Windows "

Out[11]: KMeans(algorithm='full', max_iter=600, n_clusters=3)

In [12]: kmc.cluster_centers_

Out[12]: array([[5.006 , 3.428 , 1.462 , 0.246 ],

[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],

[6.85 , 3.07368421, 5.74210526, 2.07105263]])

In [13]: kmc.labels_

Out[13]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2,

2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2,

2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1])

In [14]: pd.crosstab(kmc.labels_, iris_df['species'])

Out[14]: species setosa versicolor virginica

row_0

0 50 0 0

1 0 48 14

2 0 2 36

Metrics

https://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation (https://scikit-
learn.org/stable/modules/clustering.html#clustering-evaluation)

localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 10/11


9/8/2021 Completed 012 2021-08-23 to 2021-09-10 Part C Clustering - Jupyter Notebook

In [15]: from sklearn.metrics import silhouette_score



cluster_df = pd.DataFrame(kmc.labels_, columns = ['Cluster ID'])
# cluster_df

full_cluster_df = pd.concat([X.reset_index(drop = True), cluster_df], axis = 1)
full_cluster_df

silhouette_score(full_cluster_df, kmc.labels_, metric='euclidean')

Out[15]: 0.6128676734836785

In [ ]: ​

localhost:8888/notebooks/Completed 012 2021-08-23 to 2021-09-10 Part C Clustering.ipynb 11/11

You might also like