100% found this document useful (1 vote)
340 views48 pages

Data Mining Project Anshul

This document describes clustering analysis performed on customer data from a bank to identify customer segments. Univariate, bivariate, and multivariate exploratory data analysis was conducted on the scaled data. Hierarchical clustering identified two optimal clusters with different spending behaviors. K-means clustering with three clusters was also performed and validated using the elbow method and silhouette scores. The three clusters were described in terms of their spending profiles and promotional strategies were recommended targeted to each cluster. Specifically, premium offers and rewards were suggested for high spending customers in Cluster 2, while analysis of low spending in Cluster 0 was advised.

Uploaded by

Anshul Mendhekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
340 views48 pages

Data Mining Project Anshul

This document describes clustering analysis performed on customer data from a bank to identify customer segments. Univariate, bivariate, and multivariate exploratory data analysis was conducted on the scaled data. Hierarchical clustering identified two optimal clusters with different spending behaviors. K-means clustering with three clusters was also performed and validated using the elbow method and silhouette scores. The three clusters were described in terms of their spending profiles and promotional strategies were recommended targeted to each cluster. Specifically, premium offers and rewards were suggested for high spending customers in Cluster 2, while analysis of low spending in Cluster 0 was advised.

Uploaded by

Anshul Mendhekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Project – Data Mining

Clustering Analysis
Problem Statement
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past few
months. You are given the task to identify the segments based on credit card usage
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).

Introduction of Data Set


As we have already uploaded the necessary python libraries, we will import the data set and
give an initial understanding of the data

Rows and Columns


The data set contains 210 Rows and 7 Columns

Description of data set


This is done using the command – describe in python
We have used this command to describe the mean, standard deviation and IQR ranges
Data Type Summary
As per above table we can see that there are float data types which are present in the set

Is-null Function
After performing the is null function we conclude that there are no missing values present in
the data set

Checking for Duplicate values


There are no duplicate values in the dataset which needs to be treated
Performing Univariate/Bivariate Analysis

Analysis of Variable – Spending


There are no outliers in this variable
Spending is positively Skewed with value (0.39)

Analysis of Variable – advance_payments


There are no outliers in this variable
Advance_payments is positively Skewed with value (0.38)
Analysis of Variable – Probability_of_full_payment
There are outliers present in this variable
Probability_of_full_payment is negatively Skewed with value (- 0.53)

Analysis of Variable – current_balance


There are no outliers present in this variable
Current_Balance is positively Skewed with value (0.52)
Analysis of Variable – credit_limit
There are no outliers present in this variable
Credit_limit is positively Skewed with value (0.13)

Analysis of Variable – min_payment_amt


There are outliers present in this variable
min_payment_amt is positively Skewed with value (0.40)
Analysis of Variable – max_spend_in_single_shopping
There are no outliers present in this variable
max_spend_in_single_shopping is positively Skewed with value (0.56)
Pair-Plot with different variables

Highlights of Pair Plot Map


 Variables such as probability_of_full_payment is left-skewed and
probability_of_full_payment is right skewed
 There are no variables which are normally distributed
 Good correlation between variables such as spending & advance payments, spending
and credit_limit,spending & max_spent_in_single_shopping, spending & current balance
Heatmap

There is strong positive correlation between different variables which are as follows

 Max_spent_in_single_shopping and Current balance


 Spending and advance payments
 Advance payments and current balance
 Credit limit and Spending
 Spending and Current balance
 Credit limit and Advance Payments
1.2 Do you think scaling is necessary for clustering in this case? Justify

Normalization of data is necessary in this case because the variance of variables is not aligned.
It means that some variables have large variance, and some have small. If scaling is not done,
then a variable might have range which might impact our data significantly.
Also, we know that Kmeans and Hierarchal both are based on distance-based algorithms.

If we look on the data, we will note that some of the variables are on different scales for e.g.,
#probability_of_full_payment is on 10 to the power of 4 (10^4) whereas spending is on
hundredth scale.
Hence, to analyze further on data we need to scale our variables so that everything is
standardized.

We will be using Z-score technique to scale the data.

Scaled Data below


Outliers in the Data Set

Min_payment_amt and probability_of_full_payments After Outlier treatment


have outliers

1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them

We will be using the Ward method to the scaled data, this is represented by with the help
of the Dendogram

This dendogram represents all the clusters which are formed due to Ward’s method
To find the optimal number of clusters we will use Truncate mode = lastp
We use industry set standards and give last p as 10

As per the above Dendogram we can see that there are 2 clusters which are forming as per
Horizontal lines.

To map these clusters in our data set we can use fcluster method
df1 and df2 Clusters Comparison

Outcomes from the above Cluster Analysis

1. Cluster 1 has higher spending habit as compared to Cluster 2 when we look at Std
Deviation and Mean
2. For variable Advance_payments Cluster 1 are better than Cluster 2
3. When we consider the variable Max_Spent_in_Single_Shopping customers from
Cluster 1 are spending more
4. Considering the Credit_Limit there is marginal difference

Suggestions

1. We can increase the credit limit for both of our customers as there is less difference
between the two clusters
2. As customers from cluster 2 are little behind when considering Advance Payment, we
can introduce some schemes like giving more Credit points, etc.
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on
the finalized clusters.

K-Means Clustering
n_cluster =3
We will apply K-Means Clustering, for now we will Randomly give n_clusters = 3 and check
the data and we can see that cluster output are in the form of 0,1 and 2

K-Means Clustering
n_cluster =2
We will apply K-Means Clustering, for now we will Randomly give n_clusters = 2 and check
the data and we can see that cluster output are in the form of 0,1
Now, to find the optimum number of clusters we are using K-Elbow methods and WSS
WSS (Sum of Squares) is the distance between every observation with the Centroid of
Cluster, squaring it up and then adding it

K-Elbow Graph

The optimum number of clusters can be 2 or 3.


As we can see from the graph that there is significant amount of drop in value from cluster
1 till cluster 3 and then there is marginal decrease in WSS Value.
For now, we will consider both the clusters 2 and 3.

As we are checking the Silhouette Score when cluster == 2, for one of the records we get 1
negative score
As we are checking the Silhouette Score when cluster == 3, there are no records which
shows negative values

Considering optimum clusters as 3 because there are no negative values as per above
record and data.

1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.

Cluster == 0

Cluster == 1
Cluster ==2

Comparing all the three clusters with each other here are the observations
Observations for the above Clusters

Brief Summary about the Clusters

Customers who come under the Cluster 2 (K-label =2) are Premium/Elite people because all the
parameters (variables) are comparatively higher as compared to other clusters.
Max spent in single shopping for Cluster 1 and Cluster 0 are same which means the average
spent on single shopping is somewhat similar but for Cluster 2 is higher.

Minimum paid amount by customer for purchases made monthly is higher for cluster 0.

Credit limit is lowest for cluster 0 when compared to Cluster 1 and 2.

The spending habit or purchases are more in the Cluster 2 then followed by Cluster 1 and
Cluster 0. This trend is also similar for the parameter advance payments .

Constructing a Tabular Format for easier analysis and Understanding

We have segregated the Clusters as per below

Cluster == 2
(Premium/Elite Customers basically High
Spending)

Cluster == 1
(Normal Customers – Medium Spending)

Cluster == 0
(Low Spender Customers)
Cluster == 2 Cluster == 1 Cluster == 0

(Premium/Elite Customers (Normal Customers – (Low Spender Customers)


basically High Spending) Medium Spending)

 These customers have  They have above  We have observed a


higher spending limit average spending lower credit score
 Banks can give them habit here
more offers, coupons  Banks can offer them  The average spending
and good credit card some loyalty rewards is very low
offers to further for them as per  Banks should conduct
increase their current balance an analysis as to why
purchases amount there are spending
 Bank can also less
introduce loans based
on Credit limit or
Current balance
 Reason for high
spending could be
higher income groups
CART-RF-ANN
Problem Statement
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model
which predicts the claim status and provide recommendations to management.
Use CART, RF & ANN and compare the models' performances in train and test sets.
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis)
Introduction of Data Set
After importing the libraries and the insurance file here’s a sample overview of the data
There are 10 variables with different data types

The below sample is obtained using the head command which shows the “first 5
records”

The below sample is obtained using the head command which shows the “last 5
records”
Rows and Columns
The data set contains 3000 Rows and 10 Columns

Description of data set


This is done using the command – describe in pythonWe have used this command to
describe the mean, standard deviation and IQR ranges

The below variables are shown because the mean, median and other stats because
these are the Int and Float data types

Also, we can see the Duration has minimum value as -1 which is odd
Commission and Sales mean varies a lot

The below command represents all the variables for which mean, median and other
stats are shown as we have included the “All” parameter
Data Type Summary
As we can see variables are categorized into different data types like int, float and object
There are 10 variables which are categorized as Numeric and Categorical
Numeric are Age, Commission, Duration and Sales and others are categorical
Our target variable is Claimed

Is-null Function
After performing the is null function we conclude that there are no missing values
present in the data set

Checking for Duplicate values


There are 139 duplicate values in the dataset
There are no unique identifiers like customer id, hence no need for treatment
Getting the Unique Counts for Categorical Variables
Performing Univariate/Bivariate Analysis

Analysis of Variable – Age


The box plot shows that there are many outliers in the variable

Analysis of Variable – Commission


The box plot shows that there are many outliers in the variable
Analysis of Variable – Duration
The box plot shows that there are many outliers in the variable
Analysis of the variable – Sales
The box plot shows that there are many outliers in the variable
Observations-
All the variables stated above have outliers in them, but Sales and Commission have
outliers which might be genuine and might have business impact.
Outlier treatment will be done in the next steps.
Categorical Variable Analysis

Agency_Code

Type
Claimed

Channel
Product Name

Destination
Pair-Plot Analysis

As per above graph we can see that some amount of relationship between the Variables
Heatmap Analysis

All the numeric variables above are all positively corelated with each other
From the above heatmap we can say sales and commission correlation is stronger than
other variables

Converting Categorical Values to Numerical Values


Conversion Check

Proportion of 1s and 0s

As per below tabular table we can say that there is no imbalance as we have rational
proportions in both the classes

2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network
The “Claimed” variable is separated into a Vector for Training set and Train set

Plotted Graph showing Data before Scaling


After Scaling the data

Loading the Libraries and splitting the data into Training Sets
Performing Dimension check for training and test sets

Building a Decision Tree Classifier


We are initializing the model and then doing a fit on the data
The grid search will help us to find the best estimator and parameter,
This helps us to generate the tree to find the minimum number of leaf’s
Doing Variable Importance – DTCL
From the below table we can see that Agency_Code is the most important variable and
is the root node

Getting the Predicted Classes and Probs


Building a Random Forest Classifier
Over here we are doing Grid search to find out the optimal values for Hyper parameters

Doing Variable Importance – RFCL


As we can see here also we have agency_code as the root node

Predicting and Training the Data


Building a Neural Network Classifier
For this we need to scale the data which was already done above and also shown in the
Graph

2.3 Performance Metrics: Comment and Check the performance of Predictions on Train


and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score, classification reports for each model. 

CART Performance Metrics

AUC and ROC for the Test Data for CART

AUC and ROC for the Train Data for CART


CART Confusion Matrix and Classification Report for the Training data

Data Accuracy

Classification
Report

CART Metrics

Confusion
Matrix

Customers who have not taken claim = 536


Customers that have taken claim but model shows opposite = 113
Customers who have taken claims and model also shows same = 164
Customer who havent taken claim but model shows claim taken = 87
CART Confusion Matrix and Classification Report for the Test data

Data Accuracy

Classification Report

CART Metrics

Confusion Matrix

Customers who have not taken claim =


1256
Customers that have taken claim but
model shows opposite = 268
Customers who have taken claims and
model also shows same = 379
Customer who havent taken claim but
model shows claim taken = 195
CART Model Conclusion
Overall, a good model as the test and train model data are almost similar

Parameters Train Data Test Data

AUC 81% 79%

Accuracy 77% 77%

Precision 66% 65%

F1-Score 62% 62%

Random Forest Model Performance Metrics

AUC and ROC for the Test Data for RF

AUC and ROC for the Train Data for RF


Random Forest Confusion Matrix and Classification Report for the Test data

Data Accuracy

Classification Report

Random Forest
Metrics

Confusion Matrix

Customers who have not taken claim =


549
Customers that have taken claim but
model shows opposite = 129
Customers who have taken claims and
model also shows same = 148
Customer who havent taken claim but
model shows claim taken = 74
Random Forest Confusion Matrix and Classification Report for the Train data

Data Accuracy

Classification Report

Random Forest
Metrics

Confusion Matrix

Customers who have not taken claim =


1268
Customers that have taken claim but
model shows opposite = 264
Customers who have taken claims and
model also shows same = 383
Customer who havent taken claim but
model shows claim taken = 155
RF Model Conclusion

Parameters Train Data Test Data

AUC 86% 81%

Accuracy 80% 77%

Precision 71% 67%

F1-Score 65% 59%

Good model as the values of the test and train model are almost similar

Neural Networks Model Performance Metrics

AUC and ROC for the Test Data for NN

AUC and ROC for the Train Data for NN


NN Confusion Matrix and Classification Report for the Train data

Data ake
Accuracy

Classification
Report

NN Metrics

Confusion
Matrix

Customers who have not taken claim = 1298


Customers that have taken claim but model
shows opposite = 315
Customers who have taken claims and model
also shows same = 332
Customer who havent taken claim but model
shows claim taken = 155
NN Confusion Matrix and Classification Report for the Test data

Data Accuracy

Classification Report

NN Metrics

Confusion Matrix

Customers who have not taken claim = 563


Customers that have taken claim but model
shows opposite = 138
Customers who have taken claims and model
also shows same = 139
Customer who havent taken claim but model
shows claim taken = 70
NN Model Conclusion

Parameters Train Data Test Data

AUC 81% 80%

Accuracy 77% 76%

Precision 68% 67%

F1-Score 59% 57%

Good model as the values of the test and train model are almost similar
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized.
Summary of the Models

ROC Curve for Training Data

ROC Curve for Test Data

From the above models and summary of data we can conclude than Random Forest is
the best model since it has got higher parameters in terms of Accuracy, Precision, F-1
Score, Recall and AUC as compared to other models
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations

Insurance is an industry which is dependent on many factors when claiming an


insurance like what are the Customer previous ailments, Accidents and weather
conditions, behavior patterns and vehicle types.

Just like Policy Bazar which is now hit in market for online insurance buying and claiming
we have also seen here than an online experience has helped customers which lead to
more conversions and aided in profit booking.

Looking at the Agency_Code where it is JZI, this agency is at the bottom when
considering the sales, they need good marketing campaigns and strategies.
Need to focus more on SEO based growth marketing.

Business/Companies need to motivate or hire new agencies to improve their sales and
marketing.

As per our data we have seen 80% accuracy, so we might have to do cross selling of
insurances based on Claimed Data Pattern.

As per our data and insights, we have seen that more claims processing happened
through Airlines, but sales are more when Agencies are involved.

We need to increase awareness on how customers can claim insurances, what are the
terms and conditions this will help the company to reduce claim cycles and, help with
good rating.
Also, need to look at insurance frauds and how that be eliminated. We would need to
look at past data and analyze it.

You might also like