Low Code AIML USL Project CreditCardCustomerSegmentation Vijay Borade Aug23
Low Code AIML USL Project CreditCardCustomerSegmentation Vijay Borade Aug23
BUSINES
REPORT
Contents ....................................................................................................................................................................... 1
Context ......................................................................................................................................................................... 2
What is customer segmentation .................................................................................................................................... 3
Machine Learning for Customer Segmentation .............................................................................................................. 4
Advantages of Customer Segmentation ......................................................................................................................... 5
Exploring Customer dataset and its feature ................................................................................................................... 6
Pre-processing Dataset .................................................................................................................................................. 7
Implementing K-means Clustering in Python ............................................................................................................ 8-12
Pre-processing dataset ................................................................................................................................................ 13
Loading Dataset .......................................................................................................................................................... 14
Creating copy of data .................................................................................................................................................. 15
Checking missing value ................................................................................................................................................ 16
Checking duplicate value ............................................................................................................................................. 17
Explore data Analysis ............................................................................................................................................. 18-27
Univariate Analysis ................................................................................................................................................. 28-37
Bivariate Analysis ................................................................................................................................................... 38-41
Data Processing ........................................................................................................................................................... 42
Finding the optimal number of Clustering .................................................................................................................... 43
Optimal Value of K =5 ................................................................................................................................................. 44
Cluster Visualization .................................................................................................................................................... 45
Let’s checking the Silhouette Scores ............................................................................................................................ 46
Let's visualize the silhouette scores for different number of clusters ....................................................................... 47-50
Hierarchical Clustering................................................................................................................................................. 51
Executive Summary
AllLife Bank wants to focus on its credit card customer base in the next financial year.
They have been advised by their marketing research team, that the penetration in the
market can be improved. Based on this input, the Marketing team proposes to run
personalized campaigns to target new customers as well as upsell to existing customers.
Another insight from the market research was that the customers perceive the support
services of the back poorly. Based on this, the Operations team wants to upgrade the
service delivery model, to ensure that customer queries are resolved faster. Head of
Marketing and Head of Delivery both decide to reach out to the Data Science team for
help
What is customer 3
segmentation
There are many machine learning algorithms, each suitable for a specific type of problem. One
very common machine learning algorithm that’s suitable for customer segmentation
problems is the k-means clustering algorithm. There are other clustering algorithms as well
such as DBSCAN, Agglomerative Clustering, and BIRCH, etc.
Advantages of
customer segmentation
Implementing customer segmentation
leads to plenty of new business
opportunities. You can do a lot of
optimization in:
• budgeting,
• product design,
• promotion,
• marketing,
• customer satisfaction.
Pre-Processing Dataset
Before feeding the data to the k-means clustering algorithm, we need to pre-process
the dataset. Let’s implement the necessary pre-processing for the customer dataset.
Unsupervised Learning:-
Unsupervised machine learning algorithms can group data points based on similar
attributes in the dataset. One of the main types of unsupervised models is clustering
models.
Note that, supervised learning helps us produce an output from the previous
experience.
Clustering algorithms
• K-Means Clustering
• Agglomerative Hierarchical Clustering
• Expectation-Maximization (EM) Clustering
• Density-Based Spatial Clustering
• Mean-Shift Clustering
9
In the following section, we’re going to analyze the customer segmentation problem
using the k-means clustering algorithm and machine learning. However, before that,
let’s quickly discuss why we’re using the k-means clustering algorithm.
The algorithm discovers groups (clusters) in the data, where the number of clusters is
represented by the K value. The algorithm acts iteratively to assign each input data to
one of the K clusters, as per the features provided. All of this makes k-means quite
suitable for the customer segmentation problem.
Given a set of data points are grouped as per feature similarity. The output of the K-
means clustering algorithm is:
At the end of implementation, we’re going to get output such as a group of clusters
along with which customer belongs to which cluster.
10
K Means Clustering
Pre-processing dataset
First, we need to implement the required Python libraries as shown in the table below.
We’ve imported the pandas, NumPy sklearn, plotly and matplotlib libraries. Pandas
and NumPy are used for data wrangling and manipulation, sklearn is used for
modelling, and plotly along with matplotlib will be used to plot graphs and images.
After importing the library, our next step is to load the data in the panda’s data frame.
For this, we’re going to use the reading method of pandas.
14
Overview of Dataset
The initial steps to get an overview of any dataset is to:
• observe the first few rows of the dataset, to check whether the dataset has been
loaded properly or not
• get information about the number of rows and columns in the dataset
• find out the data types of the columns to ensure that data is stored in the preferred
format and the value of each property is as expected.
• check the statistical summary of the dataset to get an overview of the numerical
columns of the data
The below functions need to be defined to carry out the Exploratory Data Analysis.
19
20
21
22
23
24
25
26
27
28
Univariate analysis
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Data Processing
Outlier Detection
Scaling
43
Optimal Value of K = 5
45
Cluster Visualization
46
Hierarchical Clustering
Checking Dendrograms
We see that the cophenetic correlation is maximum with Euclidean distance and average
linkage.
Let's view the dendrograms for the different linkage methods.
53
54
55
1. Dendrogram Visualization:
Plot the Dendrogram: Visualize the hierarchical structure using a dendrogram to
understand how clusters merge at different linkage distances.
2. Cutting the Dendrogram:
Select the Number of Clusters: Decide on the number of clusters by cutting the
dendrogram at an appropriate height or distance.
3. Assigning Cluster Labels:
Cut and Assign Labels: Use the chosen number of clusters to cut the dendrogram and
assign cluster labels to your data points.
4. Cluster Profiling:
Compute Cluster Profiles: Calculate statistics (such as means, medians, standard
deviations) for each feature within each identified cluster.
Visualize Profiles: Create visualizations (like box plots, bar plots) to compare attributes
across different clusters.
57
You can also mention any differences or similarities you obtained in the cluster profiles from
both the clustering techniques.
58
59
60
K-means Clustering: Algorithm: Iterative algorithm that partitions data into K clusters based
on the mean (centroid) of data points.
Number of Clusters (K): Requires specifying the number of clusters (K) beforehand.
Scalability: Scales well for large datasets and is computationally efficient.
Cluster Shape: Assumes clusters as spherical and evenly sized, which might not fit well for
complex or non-spherical clusters.
Result Sensitivity: Sensitive to initial centroid selection, often converging to local optima,
which can lead to different results on multiple runs.
Interpretability: Offers simpler interpretability due to straightforward cluster assignment.
Hierarchical Clustering: Agglomerative vs Divisive: Builds a hierarchy of clusters either by
merging (agglomerative) or by dividing (divisive) until all data points belong to one cluster.
Hierarchy Visualization: Produces dendrograms that display the merging/dividing process
and allow choosing the number of clusters post hoc.
Number of Clusters: Doesn’t require specifying the number of clusters beforehand but needs
a method to determine the stopping point.
Cluster Shape: Can handle clusters of various shapes and sizes.
Computation: Can be more computationally expensive, especially for large datasets.
Interpretability: Offers more insights into the hierarchical relationships between clusters due
to its tree-like structure.
Choosing Between Them: Data Structure: Consider the nature of your data; for example, K-
means might perform better on well-separated spherical clusters, while hierarchical
clustering might handle more complex relationships.
Number of Clusters: If you have a specific number of clusters in mind, K-means might be
more suitable. Otherwise, hierarchical clustering offers flexibility in choosing the number post
hoc.
Interpretability vs Performance: If interpretability and visual representation of hierarchy are
crucial, hierarchical clustering might be more beneficial. For computational efficiency and
simplicity in assignment, K-means might be preferred.
In practice, it's often beneficial to try both methods and evaluate them based on clustering
quality, domain knowledge, and the specific objectives of your analysis.
Would you like to explore specific aspects of these clustering techniques further?
61
Lets Create some plot on the original data to understand customer distribution among
the cluster
62
63