Labsheet2
Labsheet2
Applications
(10 Marks) of Machine Learning Labsheet 2 - K-means Clustering of Drivers Data
Name: Alpesh Oza
Roll No. : AA.SC.U3CSC2107022
Given data consist of 4000 drivers with ID, mean_dist_day and mean_overspeed_perc.
Get the scatter plot of the dataset Apply K-means clustering algorithm with K=3, 4, 5 and 6. Plot the dataset as clusters Visually inspect
the plots and infer. According to you what is the apt value for K ?
Step 1: Import the Required Libraries
We start by importing the necessary libraries:
pandas: For handling and manipulating the dataset.
numpy: For numerical operations.
matplotlib: For plotting scatter plots and cluster results.
KMeans from sklearn.cluster : For applying K-means clustering on the data.
In [1]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
id mean_dist_day mean_over_speed_perc
0 3423311935 71.24 28
1 3423313212 52.53 25
2 3423313724 64.54 27
3 3423311373 55.69 22
4 3423310999 54.58 25
localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 2/8
24/09/2024, 23:52 Labsheet2
localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 3/8
24/09/2024, 23:52 Labsheet2
localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 4/8
24/09/2024, 23:52 Labsheet2
localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 5/8
24/09/2024, 23:52 Labsheet2
localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 6/8
24/09/2024, 23:52 Labsheet2
Based on the visual inspection of the scatter plots, we conclude that K=4 is the optimal value for clustering the drivers' data. This value
provides the best separation of clusters, ensuring that each group of drivers is distinct based on their driving behavior (mean distance
per day and mean overspeed percentage).
In [ ]:
localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 8/8