Data Mining Project Anshul
Data Mining Project Anshul
Clustering Analysis
Problem Statement
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past few
months. You are given the task to identify the segments based on credit card usage
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).
Is-null Function
After performing the is null function we conclude that there are no missing values present in
the data set
There is strong positive correlation between different variables which are as follows
Normalization of data is necessary in this case because the variance of variables is not aligned.
It means that some variables have large variance, and some have small. If scaling is not done,
then a variable might have range which might impact our data significantly.
Also, we know that Kmeans and Hierarchal both are based on distance-based algorithms.
If we look on the data, we will note that some of the variables are on different scales for e.g.,
#probability_of_full_payment is on 10 to the power of 4 (10^4) whereas spending is on
hundredth scale.
Hence, to analyze further on data we need to scale our variables so that everything is
standardized.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
We will be using the Ward method to the scaled data, this is represented by with the help
of the Dendogram
This dendogram represents all the clusters which are formed due to Ward’s method
To find the optimal number of clusters we will use Truncate mode = lastp
We use industry set standards and give last p as 10
As per the above Dendogram we can see that there are 2 clusters which are forming as per
Horizontal lines.
To map these clusters in our data set we can use fcluster method
df1 and df2 Clusters Comparison
1. Cluster 1 has higher spending habit as compared to Cluster 2 when we look at Std
Deviation and Mean
2. For variable Advance_payments Cluster 1 are better than Cluster 2
3. When we consider the variable Max_Spent_in_Single_Shopping customers from
Cluster 1 are spending more
4. Considering the Credit_Limit there is marginal difference
Suggestions
1. We can increase the credit limit for both of our customers as there is less difference
between the two clusters
2. As customers from cluster 2 are little behind when considering Advance Payment, we
can introduce some schemes like giving more Credit points, etc.
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on
the finalized clusters.
K-Means Clustering
n_cluster =3
We will apply K-Means Clustering, for now we will Randomly give n_clusters = 3 and check
the data and we can see that cluster output are in the form of 0,1 and 2
K-Means Clustering
n_cluster =2
We will apply K-Means Clustering, for now we will Randomly give n_clusters = 2 and check
the data and we can see that cluster output are in the form of 0,1
Now, to find the optimum number of clusters we are using K-Elbow methods and WSS
WSS (Sum of Squares) is the distance between every observation with the Centroid of
Cluster, squaring it up and then adding it
K-Elbow Graph
As we are checking the Silhouette Score when cluster == 2, for one of the records we get 1
negative score
As we are checking the Silhouette Score when cluster == 3, there are no records which
shows negative values
Considering optimum clusters as 3 because there are no negative values as per above
record and data.
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.
Cluster == 0
Cluster == 1
Cluster ==2
Comparing all the three clusters with each other here are the observations
Observations for the above Clusters
Customers who come under the Cluster 2 (K-label =2) are Premium/Elite people because all the
parameters (variables) are comparatively higher as compared to other clusters.
Max spent in single shopping for Cluster 1 and Cluster 0 are same which means the average
spent on single shopping is somewhat similar but for Cluster 2 is higher.
Minimum paid amount by customer for purchases made monthly is higher for cluster 0.
The spending habit or purchases are more in the Cluster 2 then followed by Cluster 1 and
Cluster 0. This trend is also similar for the parameter advance payments .
Cluster == 2
(Premium/Elite Customers basically High
Spending)
Cluster == 1
(Normal Customers – Medium Spending)
Cluster == 0
(Low Spender Customers)
Cluster == 2 Cluster == 1 Cluster == 0
The below sample is obtained using the head command which shows the “first 5
records”
The below sample is obtained using the head command which shows the “last 5
records”
Rows and Columns
The data set contains 3000 Rows and 10 Columns
The below variables are shown because the mean, median and other stats because
these are the Int and Float data types
Also, we can see the Duration has minimum value as -1 which is odd
Commission and Sales mean varies a lot
The below command represents all the variables for which mean, median and other
stats are shown as we have included the “All” parameter
Data Type Summary
As we can see variables are categorized into different data types like int, float and object
There are 10 variables which are categorized as Numeric and Categorical
Numeric are Age, Commission, Duration and Sales and others are categorical
Our target variable is Claimed
Is-null Function
After performing the is null function we conclude that there are no missing values
present in the data set
Agency_Code
Type
Claimed
Channel
Product Name
Destination
Pair-Plot Analysis
As per above graph we can see that some amount of relationship between the Variables
Heatmap Analysis
All the numeric variables above are all positively corelated with each other
From the above heatmap we can say sales and commission correlation is stronger than
other variables
Proportion of 1s and 0s
As per below tabular table we can say that there is no imbalance as we have rational
proportions in both the classes
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network
The “Claimed” variable is separated into a Vector for Training set and Train set
Loading the Libraries and splitting the data into Training Sets
Performing Dimension check for training and test sets
Data Accuracy
Classification
Report
CART Metrics
Confusion
Matrix
Data Accuracy
Classification Report
CART Metrics
Confusion Matrix
Data Accuracy
Classification Report
Random Forest
Metrics
Confusion Matrix
Data Accuracy
Classification Report
Random Forest
Metrics
Confusion Matrix
Good model as the values of the test and train model are almost similar
Data ake
Accuracy
Classification
Report
NN Metrics
Confusion
Matrix
Data Accuracy
Classification Report
NN Metrics
Confusion Matrix
Good model as the values of the test and train model are almost similar
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized.
Summary of the Models
From the above models and summary of data we can conclude than Random Forest is
the best model since it has got higher parameters in terms of Accuracy, Precision, F-1
Score, Recall and AUC as compared to other models
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations
Just like Policy Bazar which is now hit in market for online insurance buying and claiming
we have also seen here than an online experience has helped customers which lead to
more conversions and aided in profit booking.
Looking at the Agency_Code where it is JZI, this agency is at the bottom when
considering the sales, they need good marketing campaigns and strategies.
Need to focus more on SEO based growth marketing.
Business/Companies need to motivate or hire new agencies to improve their sales and
marketing.
As per our data we have seen 80% accuracy, so we might have to do cross selling of
insurances based on Claimed Data Pattern.
As per our data and insights, we have seen that more claims processing happened
through Airlines, but sales are more when Agencies are involved.
We need to increase awareness on how customers can claim insurances, what are the
terms and conditions this will help the company to reduce claim cycles and, help with
good rating.
Also, need to look at insurance frauds and how that be eliminated. We would need to
look at past data and analyze it.