0% found this document useful (0 votes)
72 views

Data Science Project VI - Ipynb - Colaboratory

The document describes a customer segmentation clustering project using a dataset containing customer data. The goal is to identify patterns in customer behavior and group customers into clusters to gain insights. The dataset includes variables for customer ID, gender, age, income, and spending score. Data visualization and preprocessing steps are shown, including encoding categorical variables and selecting features for clustering. Hypotheses about relationships between age, gender, and spending are explored through visualizations of the data.

Uploaded by

Jack Frozt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Data Science Project VI - Ipynb - Colaboratory

The document describes a customer segmentation clustering project using a dataset containing customer data. The goal is to identify patterns in customer behavior and group customers into clusters to gain insights. The dataset includes variables for customer ID, gender, age, income, and spending score. Data visualization and preprocessing steps are shown, including encoding categorical variables and selecting features for clustering. Hypotheses about relationships between age, gender, and spending are explored through visualizations of the data.

Uploaded by

Jack Frozt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

02/09/22 13.41 Data Science Project VI.

ipynb - Colaboratory

Dataset Description
Customer Segmentation Clustering

Tujuan dari project ini adalah untuk menemukan pola dari perilaku customer dan dibagi menjadi
beberapa cluster untuk bisa menjadi sebuah insight.

Ket. Dataset

CustomerID = Identitas tiap customer


Gender = Jenis Kelamin Customer
Age = Umur Customer
Annual Income = Income per bulan
Spending Score = Score Customer dalam membelanjakan uang mereka (1-100)

Import Library

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.cluster import AgglomerativeClustering

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive

df = pd.read_csv(r'/content/drive/MyDrive/digital skola/Dataset18_Clustering_Customer.csv'
df.head()

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 1/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

CustomerID Gender Age Annual Income (k$) Spending Score (1-100)


df.shape

0 1 Male 19 15 39
(200, 5)
1 2 Male 21 15 81

2 3 Female 20 16 6
df.isnull().sum()

3 4 Female 23 16 77
CustomerID 0

4
Gender 5 Female 310
17 40
Age 0

Annual Income (k$) 0

Spending Score (1-100) 0

dtype: int64

df.describe()

CustomerID Age Annual Income (k$) Spending Score (1-100)

count 200.000000 200.000000 200.000000 200.000000

mean 100.500000 38.850000 60.560000 50.200000

std 57.879185 13.969007 26.264721 25.823522

min 1.000000 18.000000 15.000000 1.000000

25% 50.750000 28.750000 41.500000 34.750000

50% 100.500000 36.000000 61.500000 50.000000

75% 150.250000 49.000000 78.000000 73.000000

max 200.000000 70.000000 137.000000 99.000000

sns.heatmap(df.corr(), annot=True)

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 2/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

<matplotlib.axes._subplots.AxesSubplot at 0x7f6b9768b450>

Hipotesa awal

Semakin muda maka semakin sering spending semakin tua maka semakin jarang belanja
semakin tinggi gaji mka semakin sering berbelanja

Data visual

sns.set(style="whitegrid")

sns.distplot(df['Age'], color = "blue", bins=20)

plt.title("Age Distribution Plot", fontsize=14)

plt.xlabel("Age", fontsize=14)

plt.ylabel("count", fontsize=14)

plt.show()

plt.figure(figsize=(14,8))

sns.countplot(df['Age'])

plt.title("Age countplot", fontsize=14)

plt.xlabel("Age", fontsize=14)

plt.ylabel("count", fontsize=14)

plt.show()

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 3/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

sns.boxplot(df['Age'])

plt.title("Age box plot",fontsize=14)

plt.xlabel("Age", fontsize=14)

plt.show()

sns.violinplot(y="Age", x ="Gender", data = df)

plt.title("Age Violin plot with Gender", fontsize=14)

plt.xlabel("Gender", fontsize=14)

plt.ylabel("Age", fontsize=14)

plt.show()

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 4/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

spending score

sns.set(style="whitegrid")

sns.distplot(df['Spending Score (1-100)'], color = "blue", bins=20)

plt.title("Spending Score (1-100) Distribution Plot", fontsize=14)

plt.xlabel("Spending Score (1-100)", fontsize=14)

plt.ylabel("count", fontsize=14)

plt.show()

sns.boxplot(df['Spending Score (1-100)'])

plt.title("Spending Score (1-100) box plot",fontsize=14)

plt.xlabel("Spending Score (1-100)", fontsize=14)

plt.show()

plt.figure(figsize=(26,8))

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 5/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

sns.countplot(df['Spending Score (1-100)'])

plt.title("Spending Score (1-100) countplot", fontsize=14)

plt.xlabel("Spending Score (1-100)", fontsize=14)

plt.ylabel("count", fontsize=14)

plt.show()

sns.violinplot(y="Spending Score (1-100)", x ="Gender", data = df)
plt.title("Spending Score Violin plot with Gender", fontsize=14)

plt.xlabel("Gender", fontsize=14)

plt.ylabel("Spending Score", fontsize=14)

plt.show()

Perempuan usia 30-40 yang banyak spending dengan rata-rata spending score 40-50

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 6/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

Laki-laki usia 25-35 spending scorenya antara 40-60


plt.figure(figsize=(18,8))

sns.scatterplot(df['Age'], df['Spending Score (1-100)'], hue=df['Gender'], palette=['blue'
plt.title("Scatter plot distribution of gender based on Age and spending score",fontsize=1
plt.xlabel("Age", fontsize=14)

plt.ylabel("Spending Score", fontsize=14)

plt.show()

Data preprocessing

df.set_index('CustomerID',inplace=True)

df

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 7/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

Gender Age Annual Income (k$) Spending Score (1-100)

CustomerID

1 Male 19 15 39

2 Male 21 15 81

3 Female 20 16 6

4 Female 23 16 77

5 Female 31 17 40

... ... ... ... ...

196 Female 35 120 79

197 Female 45 126 28

198 Male 32 126 74


df_ss = df.copy()

199 Male 32 137 18

200 Male 30 137


from sklearn.preprocessing import LabelEncoder
83

200 rows × 4 columns


le=LabelEncoder()

df_ss['Gender'] = le.fit_transform(df_ss['Gender'])

df_ss.head()

Gender Age Annual Income (k$) Spending Score (1-100)

CustomerID

1 1 19 15 39

2 1 21 15 81

3 0 20 16 6

4 0 23 16 77

5 0 31 17 40

Age_Spend = df_ss[['Age','Spending Score (1-100)']].iloc[:,:].values

inertia_list=[]

for i in range(2,9):

  kmeans_us = KMeans(n_clusters=i,n_init=10,max_iter=100, random_state=0)

  kmeans_us.fit(Age_Spend)

  inertia_list.append(kmeans_us.inertia_)

plt.figure(figsize=(15,8))

plt.plot(range(2,9),inertia_list)

plt.xlabel("Num of clusters")

plt.ylabel("Distortion")

plt.show()

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 8/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

!pip install scikit-learn-extra

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/p


Collecting scikit-learn-extra

Downloading scikit_learn_extra-0.2.0-cp37-cp37m-manylinux2010_x86_64.whl (1.7 MB)

|████████████████████████████████| 1.7 MB 5.2 MB/s

Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-package


Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.7/dist-
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-package
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-
Installing collected packages: scikit-learn-extra

Successfully installed scikit-learn-extra-0.2.0

from sklearn_extra.cluster import KMedoids
labels_kmedoid = KMedoids(n_clusters=4).fit_predict(Age_Spend)
labels_kmedoid

array([1, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 1, 0, 2, 1, 2,

0, 2, 0, 2, 0, 1, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 3, 2, 3, 1,

0, 1, 3, 1, 1, 1, 3, 1, 1, 3, 3, 3, 3, 3, 1, 3, 3, 1, 3, 3, 3, 1,

3, 3, 1, 1, 3, 3, 3, 3, 3, 1, 3, 1, 1, 3, 3, 1, 3, 3, 1, 3, 3, 1,

1, 3, 3, 1, 3, 3, 1, 1, 3, 1, 3, 1, 1, 3, 3, 1, 3, 1, 3, 3, 3, 3,

3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 1, 1, 1, 2, 1, 2, 3, 2, 0, 2, 0, 2,

1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 3, 2, 0, 2, 0, 2, 0, 2,

0, 2, 0, 2, 0, 2, 3, 2, 0, 2, 0, 2, 0, 2, 0, 1, 0, 2, 0, 2, 0, 2,

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 9/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

0, 2, 0, 2, 0, 2, 0, 2, 3, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,

0, 2])
labels_kmean_pp = KMeans(init='k-means++',n_clusters=4).fit_predict(Age_Spend)

labels_kmean_pp

array([3, 1, 0, 1, 3, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 3, 3, 0, 1, 3, 1,

0, 1, 0, 1, 0, 3, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 2, 1, 2, 3,

0, 3, 2, 3, 3, 3, 2, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 3,

2, 2, 3, 3, 2, 2, 2, 2, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 3, 2, 2, 3,

3, 2, 2, 3, 2, 3, 3, 3, 2, 3, 2, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2,

2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 3, 3, 3, 1, 3, 1, 2, 1, 0, 1, 0, 1,

3, 1, 0, 1, 0, 1, 0, 1, 0, 1, 3, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0, 1,

0, 1, 0, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 3, 0, 1, 0, 1, 0, 1,

0, 1, 0, 1, 0, 1, 0, 1, 3, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,

0, 1], dtype=int32)

import scipy.cluster.hierarchy as shc

from matplotlib import pyplot

pyplot.figure(figsize=(14,8))

pyplot.title("Dendrograms")

dend = shc.dendrogram(shc.linkage(Age_Spend, method="complete"))

import scipy.cluster.hierarchy as shc

from matplotlib import pyplot

pyplot.figure(figsize=(14,8))

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 10/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

pyplot.title("Dendrograms")

dend1 = shc.dendrogram(shc.linkage(Age_Spend, method="single"))

import scipy.cluster.hierarchy as shc

from matplotlib import pyplot

pyplot.figure(figsize=(14,8))

pyplot.title("Dendrograms")

dend2 = shc.dendrogram(shc.linkage(Age_Spend, method="average"))

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 11/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

import scipy.cluster.hierarchy as shc
from matplotlib import pyplot

pyplot.figure(figsize=(14,8))
pyplot.title("Dendrograms")
dend3 = shc.dendrogram(shc.linkage(Age_Spend, method="ward"))

labels_cluster_hierarchical_Ward = AgglomerativeClustering(n_clusters=4, linkage="ward").f
labels_cluster_hierarchical_Complete = AgglomerativeClustering(n_clusters=4, linkage="comp

labels_cluster_hierarchical_Ward

array([0, 3, 2, 3, 0, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 0, 0, 0, 3, 0, 3,

2, 3, 2, 3, 0, 1, 0, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 1, 3, 0, 1,

0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,

1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,

1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1,

1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 3, 0, 3, 0, 3, 2, 3, 2, 3,

0, 3, 2, 3, 2, 3, 2, 3, 2, 3, 0, 3, 2, 3, 0, 3, 2, 3, 2, 3, 2, 3,

2, 3, 2, 3, 2, 3, 1, 3, 2, 3, 0, 3, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3,

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 12/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

2, 3, 2, 3, 0, 3, 2, 3, 0, 3, 0, 3, 2, 3, 2, 3, 2, 3, 2, 3, 0, 3,

2, 3])

labels_cluster_hierarchical_Complete

array([2, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 2, 1, 0, 1, 2, 1,

0, 1, 0, 1, 0, 2, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 3, 1, 0, 2,

0, 1, 3, 2, 2, 2, 3, 2, 2, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2,

3, 3, 2, 2, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 3, 2, 3, 3, 2, 3, 3, 2,

2, 3, 3, 2, 3, 2, 2, 2, 3, 2, 3, 2, 2, 3, 3, 2, 3, 2, 3, 3, 3, 3,

3, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 1, 2, 1, 0, 1, 0, 1, 0, 1,

2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,

0, 1, 0, 1, 0, 1, 3, 1, 0, 1, 0, 1, 0, 1, 0, 2, 0, 1, 0, 1, 0, 1,

0, 1, 0, 1, 2, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,

0, 1])

from sklearn.metrics import silhouette_score

from sklearn.metrics import davies_bouldin_score

print("Silhouette score of K-Medoids: ", silhouette_score(Age_Spend, labels_kmedoid), "\n"
print("Silhouette score of K-Means++: ", silhouette_score(Age_Spend, labels_kmean_pp), "\n
print("Silhouette score of Agglo Hierarchical Ward: ", silhouette_score(Age_Spend, labels_
print("Silhouette score of Agglo Hierarchical Complete: ", silhouette_score(Age_Spend, lab

print("Davies Bouldin score of K-Medoids: ", davies_bouldin_score(Age_Spend, labels_kmedoi
print("Davies Bouldin score of K-Means++: ", davies_bouldin_score(Age_Spend, labels_kmean_
print("Davies Bouldin score of Agglo Hierarchical Ward: ", davies_bouldin_score(Age_Spend, 
print("Davies Bouldin score of Agglo Hierarchical Complete: ", davies_bouldin_score(Age_Sp

Silhouette score of K-Medoids: 0.49888640369265486

Silhouette score of K-Means++: 0.49973941540141753

Silhouette score of Agglo Hierarchical Ward: 0.4602496389565028

Silhouette score of Agglo Hierarchical Complete: 0.49294328457852726

Davies Bouldin score of K-Medoids: 0.6886500456702762

Davies Bouldin score of K-Means++: 0.6869328339833629

Davies Bouldin score of Agglo Hierarchical Ward: 0.8629286547656256

Davies Bouldin score of Agglo Hierarchical Complete: 0.6912818137334898

plt.figure(figsize=(15,8))

plt.scatter(Age_Spend[labels_kmean_pp == 0,0], Age_Spend[labels_kmean_pp == 0,1], c = 'pin
plt.scatter(Age_Spend[labels_kmean_pp == 1,0], Age_Spend[labels_kmean_pp == 1,1], c = 'ora
plt.scatter(Age_Spend[labels_kmean_pp == 2,0], Age_Spend[labels_kmean_pp == 2,1], c = 'gre
plt.scatter(Age_Spend[labels_kmean_pp == 3,0], Age_Spend[labels_kmean_pp == 3,1], c = 'red
plt.legend()

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 13/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

plt.title('Customer Segmentation using Age and spending score', fontsize=14)

plt.xlabel('Age', fontsize=14)

plt.ylabel('Spending Score', fontsize=14)

plt.show()

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 14/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory

Produk berbayar Colab


-
Batalkan kontrak di sini

check 0 d selesai pada 13.36

https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 15/15

You might also like