0% found this document useful (0 votes)
2 views

Labsheet2

The document outlines a lab exercise on K-means clustering applied to a dataset of 4000 drivers, focusing on their mean distance driven per day and mean overspeed percentage. It details the steps taken to import libraries, load the dataset, visualize it with scatter plots, and apply K-means clustering for different values of K (3, 4, 5, and 6). The conclusion drawn from the visual inspection of the clustering results is that K=4 is the optimal value for effectively separating the clusters.

Uploaded by

alpeshoza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Labsheet2

The document outlines a lab exercise on K-means clustering applied to a dataset of 4000 drivers, focusing on their mean distance driven per day and mean overspeed percentage. It details the steps taken to import libraries, load the dataset, visualize it with scatter plots, and apply K-means clustering for different values of K (3, 4, 5, and 6). The conclusion drawn from the visual inspection of the clustering results is that K=4 is the optimal value for effectively separating the clusters.

Uploaded by

alpeshoza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

24/09/2024, 23:52 Labsheet2

Applications
(10 Marks) of Machine Learning Labsheet 2 - K-means Clustering of Drivers Data
Name: Alpesh Oza
Roll No. : AA.SC.U3CSC2107022
Given data consist of 4000 drivers with ID, mean_dist_day and mean_overspeed_perc.
Get the scatter plot of the dataset Apply K-means clustering algorithm with K=3, 4, 5 and 6. Plot the dataset as clusters Visually inspect
the plots and infer. According to you what is the apt value for K ?
Step 1: Import the Required Libraries
We start by importing the necessary libraries:
pandas: For handling and manipulating the dataset.
numpy: For numerical operations.
matplotlib: For plotting scatter plots and cluster results.
KMeans from sklearn.cluster : For applying K-means clustering on the data.
In [1]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

Step 2: Load the Dataset


We load the driver-data.csv file, which contains 4000 drivers along with the following columns:
ID: Unique identifier for each driver.
localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 1/8
24/09/2024, 23:52 Labsheet2

mean_dist_day: Average distance driven per day.


mean_over_speed_perc: Percentage of time the driver spends overspeeding.
We then display the first few rows to understand the dataset structure.
In [2]: # Load the dataset
data = pd.read_csv('driver-data.csv')

# Display the first few rows to understand the dataset


print(data.head())

id mean_dist_day mean_over_speed_perc
0 3423311935 71.24 28
1 3423313212 52.53 25
2 3423313724 64.54 27
3 3423311373 55.69 22
4 3423310999 54.58 25

Step 3: Plot the Scatter Plot of the Data


We visualize the dataset using a scatter plot where:
X-axis: Represents the mean distance driven per day.
Y-axis: Represents the mean percentage of time the driver spends overspeeding.
This plot helps us understand the distribution of the data before applying clustering.
In [3]: # Scatter plot of the dataset
plt.scatter(data['mean_dist_day'], data['mean_over_speed_perc'], color='blue')
plt.title('Scatter plot of Drivers Data')
plt.xlabel('Mean Distance per Day')
plt.ylabel('Mean Overspeed Percentage')
plt.show()

localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 2/8
24/09/2024, 23:52 Labsheet2

Step 4: Apply K-means Clustering with Different Values of K


We apply K-means clustering using different values of K (number of clusters). Specifically, we try K=3, 4, 5, and 6.
For each value of K, the K-means algorithm assigns drivers to clusters based on their average distance driven and overspeeding
percentage.
The resulting clusters are plotted using different colors to represent different groups.
In [4]: # Function to plot K-means clusters with explicit n_init parameter
def plot_clusters(k, data):

localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 3/8
24/09/2024, 23:52 Labsheet2

kmeans = KMeans(n_clusters=k, n_init=10) # Explicitly set n_init to 10


data['cluster'] = kmeans.fit_predict(data[['mean_dist_day', 'mean_over_speed_perc']])

# Plot the clusters


plt.scatter(data['mean_dist_day'], data['mean_over_speed_perc'], c=data['cluster'], cmap='viridis')
plt.title(f'K-means Clustering with K={k}')
plt.xlabel('Mean Distance per Day')
plt.ylabel('Mean Overspeed Percentage')
plt.show()

# Apply K-means clustering with K=3, 4, 5, 6 and plot results


for k in [3, 4, 5, 6]:
plot_clusters(k, data)

localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 4/8
24/09/2024, 23:52 Labsheet2

localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 5/8
24/09/2024, 23:52 Labsheet2

localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 6/8
24/09/2024, 23:52 Labsheet2

Step 5: Determine the Optimal Value of K


After visually inspecting the clustering results for K=3, 4, 5, and 6, we infer the optimal value of K based on how well-separated the
clusters are.
A good value for K will have well-separated clusters with minimal overlap.
Step 6: Conclusion
localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 7/8
24/09/2024, 23:52 Labsheet2

Based on the visual inspection of the scatter plots, we conclude that K=4 is the optimal value for clustering the drivers' data. This value
provides the best separation of clusters, ensuring that each group of drivers is distinct based on their driving behavior (mean distance
per day and mean overspeed percentage).
In [ ]:

localhost:8888/doc/tree/Desktop/Labsheet2.ipynb 8/8

You might also like