24/09/2024, 23:52 Labsheet2
Applications
(10 Marks) of Machine Learning Labsheet 2 - K-means Clustering of Drivers Data
Name: Alpesh Oza
Roll No. : [Link].U3CSC2107022
Given data consist of 4000 drivers with ID, mean_dist_day and mean_overspeed_perc.
Get the scatter plot of the dataset Apply K-means clustering algorithm with K=3, 4, 5 and 6. Plot the dataset as clusters Visually inspect
the plots and infer. According to you what is the apt value for K ?
Step 1: Import the Required Libraries
We start by importing the necessary libraries:
pandas: For handling and manipulating the dataset.
numpy: For numerical operations.
matplotlib: For plotting scatter plots and cluster results.
KMeans from [Link] : For applying K-means clustering on the data.
In [1]: import pandas as pd
import numpy as np
import [Link] as plt
from [Link] import KMeans
Step 2: Load the Dataset
We load the [Link] file, which contains 4000 drivers along with the following columns:
ID: Unique identifier for each driver.
localhost:8888/doc/tree/Desktop/[Link] 1/8
24/09/2024, 23:52 Labsheet2
mean_dist_day: Average distance driven per day.
mean_over_speed_perc: Percentage of time the driver spends overspeeding.
We then display the first few rows to understand the dataset structure.
In [2]: # Load the dataset
data = pd.read_csv('[Link]')
# Display the first few rows to understand the dataset
print([Link]())
id mean_dist_day mean_over_speed_perc
0 3423311935 71.24 28
1 3423313212 52.53 25
2 3423313724 64.54 27
3 3423311373 55.69 22
4 3423310999 54.58 25
Step 3: Plot the Scatter Plot of the Data
We visualize the dataset using a scatter plot where:
X-axis: Represents the mean distance driven per day.
Y-axis: Represents the mean percentage of time the driver spends overspeeding.
This plot helps us understand the distribution of the data before applying clustering.
In [3]: # Scatter plot of the dataset
[Link](data['mean_dist_day'], data['mean_over_speed_perc'], color='blue')
[Link]('Scatter plot of Drivers Data')
[Link]('Mean Distance per Day')
[Link]('Mean Overspeed Percentage')
[Link]()
localhost:8888/doc/tree/Desktop/[Link] 2/8
24/09/2024, 23:52 Labsheet2
Step 4: Apply K-means Clustering with Different Values of K
We apply K-means clustering using different values of K (number of clusters). Specifically, we try K=3, 4, 5, and 6.
For each value of K, the K-means algorithm assigns drivers to clusters based on their average distance driven and overspeeding
percentage.
The resulting clusters are plotted using different colors to represent different groups.
In [4]: # Function to plot K-means clusters with explicit n_init parameter
def plot_clusters(k, data):
localhost:8888/doc/tree/Desktop/[Link] 3/8
24/09/2024, 23:52 Labsheet2
kmeans = KMeans(n_clusters=k, n_init=10) # Explicitly set n_init to 10
data['cluster'] = kmeans.fit_predict(data[['mean_dist_day', 'mean_over_speed_perc']])
# Plot the clusters
[Link](data['mean_dist_day'], data['mean_over_speed_perc'], c=data['cluster'], cmap='viridis')
[Link](f'K-means Clustering with K={k}')
[Link]('Mean Distance per Day')
[Link]('Mean Overspeed Percentage')
[Link]()
# Apply K-means clustering with K=3, 4, 5, 6 and plot results
for k in [3, 4, 5, 6]:
plot_clusters(k, data)
localhost:8888/doc/tree/Desktop/[Link] 4/8
24/09/2024, 23:52 Labsheet2
localhost:8888/doc/tree/Desktop/[Link] 5/8
24/09/2024, 23:52 Labsheet2
localhost:8888/doc/tree/Desktop/[Link] 6/8
24/09/2024, 23:52 Labsheet2
Step 5: Determine the Optimal Value of K
After visually inspecting the clustering results for K=3, 4, 5, and 6, we infer the optimal value of K based on how well-separated the
clusters are.
A good value for K will have well-separated clusters with minimal overlap.
Step 6: Conclusion
localhost:8888/doc/tree/Desktop/[Link] 7/8
24/09/2024, 23:52 Labsheet2
Based on the visual inspection of the scatter plots, we conclude that K=4 is the optimal value for clustering the drivers' data. This value
provides the best separation of clusters, ensuring that each group of drivers is distinct based on their driving behavior (mean distance
per day and mean overspeed percentage).
In [ ]:
localhost:8888/doc/tree/Desktop/[Link] 8/8