Clustering Algorithms SciKit Learn 1705740354
Clustering Algorithms SciKit Learn 1705740354
[55]: # Reading the dataset in and showing the head of the dataframe
customers = pd.read_csv('customer.csv')
customers.head(10)
[55]: CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
5 6 Female 22 17 76
6 7 Female 35 18 6
7 8 Female 23 18 94
8 9 Male 64 19 3
9 10 Female 30 19 72
[56]: customers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 200 non-null int64
1 Gender 200 non-null object
2 Age 200 non-null int64
3 Annual Income (k$) 200 non-null int64
4 Spending Score (1-100) 200 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.9+ KB
[57]: customers.describe()
my_bins = range(10,150,10)
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18,5))
2
ax1.set_xticks(my_bins)
ax1.set_yticks(range(0,24,2))
ax1.set_ylim(0,22)
ax1.set_title('Males')
ax1.set_xlabel('Annual Income (k$)')
ax1.set_ylabel('Count')
ax1.text(85,19, "Mean income: {:.1f}k$".format(males_income.mean()))
ax1.text(85,18, "Median income: {:.1f}k$".format(males_income.median()))
ax1.text(85,17, "Std. deviation: {:.1f}k$".format(males_income.std()))
ax2.set_xticks(my_bins)
ax2.set_yticks(range(0,24,2))
ax2.set_ylim(0,22)
ax2.set_title('Females')
ax2.set_xlabel('Annual Income (k$)')
ax2.set_ylabel('Count')
ax2.text(85,19, "Mean income: {:.1f}k$".format(females_income.mean()))
ax2.text(85,18, "Median income: {:.1f}k$".format(females_income.median()))
ax2.text(85,17, "Std. deviation: {:.1f}k$".format(females_income.std()))
# boxplot
sns.boxplot(x='Gender', y='Annual Income (k$)', data=customers, ax=ax3)
ax3.set_title('Boxplot of annual income')
plt.show()
Both Mean and median income of males is higher than females (62.2𝑘 vs 59.2𝑘). However, standard
deviation is similar for both groups. There is one outlier in male group with an annual income of
about 140k$.
3
males_spending = customers[customers['Gender']=='Male']['Spending Score␣
↪(1-100)']
spending_bins = range(0,105,5)
# males histogram
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18,5))
sns.histplot(males_spending, bins=spending_bins, kde=False, color='#0066ff',␣
↪ax=ax1, edgecolor="k", linewidth=2)
ax1.set_xticks(spending_bins)
ax1.set_xlim(0,100)
ax1.set_yticks(range(0,17,1))
ax1.set_ylim(0,16)
ax1.set_title('Males')
ax1.set_ylabel('Count')
ax1.text(50,15, "Mean spending score: {:.1f}".format(males_spending.mean()))
ax1.text(50,14, "Median spending score: {:.1f}".format(males_spending.median()))
ax1.text(50,13, "Std. deviation score: {:.1f}".format(males_spending.std()))
# females histogram
sns.histplot(females_spending, bins=spending_bins, kde=False, color='#cc66ff',␣
↪ax=ax2, edgecolor="k", linewidth=2)
ax2.set_xticks(spending_bins)
ax2.set_xlim(0,100)
ax2.set_yticks(range(0,17,1))
ax2.set_ylim(0,16)
ax2.set_title('Females')
ax2.set_ylabel('Count')
ax2.text(50,15, "Mean spending score: {:.1f}".format(females_spending.mean()))
ax2.text(50,14, "Median spending score: {:.1f}".format(females_spending.
↪median()))
# boxplot
sns.boxplot(x='Gender', y='Spending Score (1-100)', data=customers, ax=ax3)
ax3.set_title('Boxplot of spending score')
plt.show()
4
[61]: age_bins = range(15,75,5)
medians_by_age_group = customers.groupby(["Gender",pd.cut(customers['Age'],␣
↪age_bins)]).median()
medians_by_age_group.index = medians_by_age_group.index.set_names(['Gender',␣
↪'Age_group'])
medians_by_age_group.reset_index(inplace=True)
palette=['#cc66ff','#0066ff'],
alpha=0.7,edgecolor='k',
ax=ax)
ax.set_title('Median annual income of male and female customers')
ax.set_xlabel('Age group')
plt.show()
5
It is clear from the barchart above that the most wealthy customers are in age of 25-45 years old
and the largest gap between women and men is within age groups 25-30 where men are richer and
50-55 vice versa!
plt.show()
6
For both sex groups there is no significant correlation between annual income and
spending score of customers.
plt.show()
7
There are week negative correlations (<0.5) between age and spending score for both
sex groups.
[67]: scaler.fit_transform(X)
scaler
[67]: StandardScaler()
In order to find an appropriate number of clusters, the elbow method or the silhuette score
could be be used. I will show both methods and I will choose the inertia for a number of clusters
between 2 and 10 for this project. Generally, the rule is to choose the number of clusters where we
see a kink or “an elbow” in the graph or the higher silhuette score:
8
[68]: # Importing the elbow visualizer
from yellowbrick.cluster import KElbowVisualizer
import warnings
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2,10))
visualizer.fit(X)
visualizer.show()
plt.show()
visualizer.fit(X)
visualizer.show()
plt.show()
9
Both graphs above would suggest that k=5 seems to be a reasonable choice for number of clusters!
[70]: KMeans(n_clusters=5)
[71]: array([2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 1, 0, 4, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0], dtype=int32)
10
2.2 Visualization of Clusters:
[73]: fig1, ax = plt.subplots(figsize=(9, 6))
plt.show()
11
size.columns = ["KM_size"]
size
[74]: KM_size
Cluster
0 39
1 37
2 23
3 23
4 78
12
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
2.3 2-DBSCAN:
[76]: from sklearn.cluster import DBSCAN
from itertools import product
Epsilon determine a specified radius that if includes enough number of points within, we call it
dense area and minimumSamples determine the minimum number of data points we want in a
neighborhood to define a cluster.
[79]: num_of_clusters = []
sil_score = []
for p in DBSCAN_params:
DBS_clustering = DBSCAN(eps=p[0], min_samples=p[1]).fit(X)
num_of_clusters.append(len(np.unique(DBS_clustering.labels_)))
sil_score.append(silhouette_score(X, DBS_clustering.labels_))
13
A heatplot below shows how many clusters were generated by the DBSCAN algorithm
for the respective parameters combinations.
fig, ax = plt.subplots(figsize=(16,6))
sns.heatmap(pivot_1, annot=True,annot_kws={"size": 16}, cmap="YlGnBu", ax=ax)
ax.set_title('Number of clusters')
plt.show()
Although the number of clusters vary from 17 to 4, the most of the combinations gives 4-7
clusters. To decide which combination to choose I will use the silhuette score which is a metric,
and I will plot it as a heatmap again:
fig, ax = plt.subplots(figsize=(18,6))
sns.heatmap(pivot_1, annot=True, annot_kws={"size": 10}, cmap="YlGnBu", ax=ax)
ax.set_title('silhuette score')
plt.show()
14
As it is clear from the heatmap above, the global maximum is 0.26 for eps=12.5 and
min_samples=4
[83]: DBSCAN_size
Cluster
-1 18
0 112
1 8
2 34
3 24
4 4
The dataframe above shows that the DBSCAN model has created 5 clusters plus
outliers cluster (-1)
15
data=DBSCAN_clustered[DBSCAN_clustered['Cluster'] != -1],
hue='Cluster', ax=axes[0], palette='Set1', legend='full',␣
↪ s=45)
axes[0].legend()
axes[1].legend()
plt.setp(axes[0].get_legend().get_texts(), fontsize='10')
plt.setp(axes[1].get_legend().get_texts(), fontsize='10')
plt.show()
From the visualization above, we can see that graphing ‘Spending Score’ vs ‘Annual Income’
gives us a better clustering result than the ‘Spending Score’ vs ‘Age’
16
[86]: X = customers[['Annual Income (k$)','Spending Score (1-100)']]
[89]: # Calculating the distance matrix based on the euclidean distance between␣
↪datapoints
dist_matrix = euclidean_distances(feature_mtx,feature_mtx)
print(dist_matrix)
# Define the leaf label function to include 'Age' and 'Annual Income'
def llf(id):
age = int(customers['Age'][id])
income = int(customers['Annual Income (k$)'][id])
return '[%s %s %s]' % (customers['Age'][id], customers['Annual Income␣
↪(k$)'][id], customers['Gender'][id])
# Vertical dendrogram
fig = plt.figure(figsize=(18, 50))
dendrogram = hierarchy.dendrogram(Z_using_dist_matrix, leaf_label_func=llf,␣
↪orientation='right')
17
plt.tick_params(axis='y', labelsize=8)
plt.title('Dendrogram', fontsize=20)
plt.xlabel('Euclidean Distance', fontsize=15)
plt.ylabel('Customers', fontsize=15)
plt.show()
18
19
Now, we can use the ‘AgglomerativeClustering’ function from scikit-learn library to cluster the
dataset. The AgglomerativeClustering performs a hierarchical clustering using a bottom up ap-
proach. The linkage criteria determines the metric used for the merge strategy:
• Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing
approach and in this sense is similar to the k-means objective function but tackled with an
agglomerative hierarchical approach.
• Maximum or complete linkage minimizes the maximum distance between observations of pairs
of clusters.
• Average linkage minimizes the average of the distances between all observations of pairs of
clusters.
labels = aggCluster.fit_predict(dist_matrix)
labels
[92]: array([1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4,
1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 2,
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 2, 3, 0, 3, 0, 3,
2, 3, 0, 3, 0, 3, 0, 3, 0, 3, 2, 3, 0, 3, 2, 3, 0, 3, 0, 3, 0, 3,
0, 3, 0, 3, 0, 3, 2, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3,
0, 3, 0, 3, 0, 3, 0, 3, 0, 0, 0, 3, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0,
0, 0])
[93]: CustomerID Gender Age Annual Income (k$) Spending Score (1-100) \
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
cluster_
0 1
1 4
2 1
3 4
4 1
20
2.5 Visualizing Clusters:
[94]: # Visualize the clusters
plt.figure(figsize=(12, 8))
plt.title('Hierarchical Clustering')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.show()
21
Male 21
1 Female 14
Male 9
2 Female 51
Male 34
3 Female 19
Male 15
4 Female 12
Male 9
Name: cluster_, dtype: int64
clusters_Stat
22