100% found this document useful (1 vote)

340 views48 pages

Data Mining Project Anshul

This document describes clustering analysis performed on customer data from a bank to identify customer segments. Univariate, bivariate, and multivariate exploratory data analysis was conducted on the scaled data. Hierarchical clustering identified two optimal clusters with different spending behaviors. K-means clustering with three clusters was also performed and validated using the elbow method and silhouette scores. The three clusters were described in terms of their spending profiles and promotional strategies were recommended targeted to each cluster. Specifically, premium offers and rewards were suggested for high spending customers in Cluster 2, while analysis of low spending in Cluster 0 was advised.

Uploaded by

Anshul Mendhekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

340 views48 pages

Data Mining Project Anshul

Uploaded by

Anshul Mendhekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 48

Project – Data Mining

Clustering Analysis
Problem Statement
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past few
months. You are given the task to identify the segments based on credit card usage
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).

Introduction of Data Set

As we have already uploaded the necessary python libraries, we will import the data set and
give an initial understanding of the data

Rows and Columns

The data set contains 210 Rows and 7 Columns

Description of data set

This is done using the command – describe in python
We have used this command to describe the mean, standard deviation and IQR ranges
Data Type Summary
As per above table we can see that there are float data types which are present in the set

Is-null Function
After performing the is null function we conclude that there are no missing values present in
the data set

Checking for Duplicate values

There are no duplicate values in the dataset which needs to be treated
Performing Univariate/Bivariate Analysis

Analysis of Variable – Spending

There are no outliers in this variable
Spending is positively Skewed with value (0.39)

Analysis of Variable – advance_payments

There are no outliers in this variable
Advance_payments is positively Skewed with value (0.38)
Analysis of Variable – Probability_of_full_payment
There are outliers present in this variable
Probability_of_full_payment is negatively Skewed with value (- 0.53)

Analysis of Variable – current_balance

There are no outliers present in this variable
Current_Balance is positively Skewed with value (0.52)
Analysis of Variable – credit_limit
There are no outliers present in this variable
Credit_limit is positively Skewed with value (0.13)

Analysis of Variable – min_payment_amt

There are outliers present in this variable
min_payment_amt is positively Skewed with value (0.40)
Analysis of Variable – max_spend_in_single_shopping
There are no outliers present in this variable
max_spend_in_single_shopping is positively Skewed with value (0.56)
Pair-Plot with different variables

Highlights of Pair Plot Map

 Variables such as probability_of_full_payment is left-skewed and
probability_of_full_payment is right skewed
 There are no variables which are normally distributed
 Good correlation between variables such as spending & advance payments, spending
and credit_limit,spending & max_spent_in_single_shopping, spending & current balance
Heatmap

There is strong positive correlation between different variables which are as follows

 Max_spent_in_single_shopping and Current balance

 Spending and advance payments
 Advance payments and current balance
 Credit limit and Spending
 Spending and Current balance
 Credit limit and Advance Payments
1.2 Do you think scaling is necessary for clustering in this case? Justify

Normalization of data is necessary in this case because the variance of variables is not aligned.
It means that some variables have large variance, and some have small. If scaling is not done,
then a variable might have range which might impact our data significantly.
Also, we know that Kmeans and Hierarchal both are based on distance-based algorithms.

If we look on the data, we will note that some of the variables are on different scales for e.g.,
#probability_of_full_payment is on 10 to the power of 4 (10^4) whereas spending is on
hundredth scale.
Hence, to analyze further on data we need to scale our variables so that everything is
standardized.

We will be using Z-score technique to scale the data.

Scaled Data below

Outliers in the Data Set

Min_payment_amt and probability_of_full_payments After Outlier treatment

have outliers

1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them

We will be using the Ward method to the scaled data, this is represented by with the help
of the Dendogram

This dendogram represents all the clusters which are formed due to Ward’s method
To find the optimal number of clusters we will use Truncate mode = lastp
We use industry set standards and give last p as 10

As per the above Dendogram we can see that there are 2 clusters which are forming as per
Horizontal lines.

To map these clusters in our data set we can use fcluster method
df1 and df2 Clusters Comparison

Outcomes from the above Cluster Analysis

1. Cluster 1 has higher spending habit as compared to Cluster 2 when we look at Std
Deviation and Mean
2. For variable Advance_payments Cluster 1 are better than Cluster 2
3. When we consider the variable Max_Spent_in_Single_Shopping customers from
Cluster 1 are spending more
4. Considering the Credit_Limit there is marginal difference

Suggestions

1. We can increase the credit limit for both of our customers as there is less difference
between the two clusters
2. As customers from cluster 2 are little behind when considering Advance Payment, we
can introduce some schemes like giving more Credit points, etc.
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on
the finalized clusters.

K-Means Clustering
n_cluster =3
We will apply K-Means Clustering, for now we will Randomly give n_clusters = 3 and check
the data and we can see that cluster output are in the form of 0,1 and 2

K-Means Clustering
n_cluster =2
We will apply K-Means Clustering, for now we will Randomly give n_clusters = 2 and check
the data and we can see that cluster output are in the form of 0,1
Now, to find the optimum number of clusters we are using K-Elbow methods and WSS
WSS (Sum of Squares) is the distance between every observation with the Centroid of
Cluster, squaring it up and then adding it

K-Elbow Graph

The optimum number of clusters can be 2 or 3.

As we can see from the graph that there is significant amount of drop in value from cluster
1 till cluster 3 and then there is marginal decrease in WSS Value.
For now, we will consider both the clusters 2 and 3.

As we are checking the Silhouette Score when cluster == 2, for one of the records we get 1
negative score
As we are checking the Silhouette Score when cluster == 3, there are no records which
shows negative values

Considering optimum clusters as 3 because there are no negative values as per above
record and data.

1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.

Cluster == 0

Cluster == 1
Cluster ==2

Comparing all the three clusters with each other here are the observations
Observations for the above Clusters

Brief Summary about the Clusters

Customers who come under the Cluster 2 (K-label =2) are Premium/Elite people because all the
parameters (variables) are comparatively higher as compared to other clusters.
Max spent in single shopping for Cluster 1 and Cluster 0 are same which means the average
spent on single shopping is somewhat similar but for Cluster 2 is higher.

Minimum paid amount by customer for purchases made monthly is higher for cluster 0.

Credit limit is lowest for cluster 0 when compared to Cluster 1 and 2.

The spending habit or purchases are more in the Cluster 2 then followed by Cluster 1 and
Cluster 0. This trend is also similar for the parameter advance payments .

Constructing a Tabular Format for easier analysis and Understanding

We have segregated the Clusters as per below

Cluster == 2
(Premium/Elite Customers basically High
Spending)

Cluster == 1
(Normal Customers – Medium Spending)

Cluster == 0
(Low Spender Customers)
Cluster == 2 Cluster == 1 Cluster == 0

(Premium/Elite Customers (Normal Customers – (Low Spender Customers)

basically High Spending) Medium Spending)

 These customers have  They have above  We have observed a

higher spending limit average spending lower credit score
 Banks can give them habit here
more offers, coupons  Banks can offer them  The average spending
and good credit card some loyalty rewards is very low
offers to further for them as per  Banks should conduct
increase their current balance an analysis as to why
purchases amount there are spending
 Bank can also less
introduce loans based
on Credit limit or
Current balance
 Reason for high
spending could be
higher income groups
CART-RF-ANN
Problem Statement
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model
which predicts the claim status and provide recommendations to management.
Use CART, RF & ANN and compare the models' performances in train and test sets.
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis)
Introduction of Data Set
After importing the libraries and the insurance file here’s a sample overview of the data
There are 10 variables with different data types

The below sample is obtained using the head command which shows the “first 5
records”

The below sample is obtained using the head command which shows the “last 5
records”
Rows and Columns
The data set contains 3000 Rows and 10 Columns

Description of data set

This is done using the command – describe in pythonWe have used this command to
describe the mean, standard deviation and IQR ranges

The below variables are shown because the mean, median and other stats because
these are the Int and Float data types

Also, we can see the Duration has minimum value as -1 which is odd
Commission and Sales mean varies a lot

The below command represents all the variables for which mean, median and other
stats are shown as we have included the “All” parameter
Data Type Summary
As we can see variables are categorized into different data types like int, float and object
There are 10 variables which are categorized as Numeric and Categorical
Numeric are Age, Commission, Duration and Sales and others are categorical
Our target variable is Claimed

Is-null Function
After performing the is null function we conclude that there are no missing values
present in the data set

Checking for Duplicate values

There are 139 duplicate values in the dataset
There are no unique identifiers like customer id, hence no need for treatment
Getting the Unique Counts for Categorical Variables
Performing Univariate/Bivariate Analysis

Analysis of Variable – Age

The box plot shows that there are many outliers in the variable

Analysis of Variable – Commission

The box plot shows that there are many outliers in the variable
Analysis of Variable – Duration
The box plot shows that there are many outliers in the variable
Analysis of the variable – Sales
The box plot shows that there are many outliers in the variable
Observations-
All the variables stated above have outliers in them, but Sales and Commission have
outliers which might be genuine and might have business impact.
Outlier treatment will be done in the next steps.
Categorical Variable Analysis

Agency_Code

Type
Claimed

Channel
Product Name

Destination
Pair-Plot Analysis

As per above graph we can see that some amount of relationship between the Variables
Heatmap Analysis

All the numeric variables above are all positively corelated with each other
From the above heatmap we can say sales and commission correlation is stronger than
other variables

Converting Categorical Values to Numerical Values

Conversion Check

Proportion of 1s and 0s

As per below tabular table we can say that there is no imbalance as we have rational
proportions in both the classes

2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network
The “Claimed” variable is separated into a Vector for Training set and Train set

Plotted Graph showing Data before Scaling

After Scaling the data

Loading the Libraries and splitting the data into Training Sets
Performing Dimension check for training and test sets

Building a Decision Tree Classifier

We are initializing the model and then doing a fit on the data
The grid search will help us to find the best estimator and parameter,
This helps us to generate the tree to find the minimum number of leaf’s
Doing Variable Importance – DTCL
From the below table we can see that Agency_Code is the most important variable and
is the root node

Getting the Predicted Classes and Probs

Building a Random Forest Classifier
Over here we are doing Grid search to find out the optimal values for Hyper parameters

Doing Variable Importance – RFCL

As we can see here also we have agency_code as the root node

Predicting and Training the Data

Building a Neural Network Classifier
For this we need to scale the data which was already done above and also shown in the
Graph

2.3 Performance Metrics: Comment and Check the performance of Predictions on Train

and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score, classification reports for each model.

CART Performance Metrics

AUC and ROC for the Test Data for CART

AUC and ROC for the Train Data for CART

CART Confusion Matrix and Classification Report for the Training data

Data Accuracy

Classification
Report

CART Metrics

Confusion
Matrix

Customers who have not taken claim = 536

Customers that have taken claim but model shows opposite = 113
Customers who have taken claims and model also shows same = 164
Customer who havent taken claim but model shows claim taken = 87
CART Confusion Matrix and Classification Report for the Test data

Data Accuracy

Classification Report

CART Metrics

Confusion Matrix

Customers who have not taken claim =

1256
Customers that have taken claim but
model shows opposite = 268
Customers who have taken claims and
model also shows same = 379
Customer who havent taken claim but
model shows claim taken = 195
CART Model Conclusion
Overall, a good model as the test and train model data are almost similar

Parameters Train Data Test Data

AUC 81% 79%

Accuracy 77% 77%

Precision 66% 65%

F1-Score 62% 62%

Random Forest Model Performance Metrics

AUC and ROC for the Test Data for RF

AUC and ROC for the Train Data for RF

Random Forest Confusion Matrix and Classification Report for the Test data

Data Accuracy

Classification Report

Random Forest
Metrics

Confusion Matrix

Customers who have not taken claim =

549
Customers that have taken claim but
model shows opposite = 129
Customers who have taken claims and
model also shows same = 148
Customer who havent taken claim but
model shows claim taken = 74
Random Forest Confusion Matrix and Classification Report for the Train data

Data Accuracy

Classification Report

Random Forest
Metrics

Confusion Matrix

Customers who have not taken claim =

1268
Customers that have taken claim but
model shows opposite = 264
Customers who have taken claims and
model also shows same = 383
Customer who havent taken claim but
model shows claim taken = 155
RF Model Conclusion

Parameters Train Data Test Data

AUC 86% 81%

Accuracy 80% 77%

Precision 71% 67%

F1-Score 65% 59%

Good model as the values of the test and train model are almost similar

Neural Networks Model Performance Metrics

AUC and ROC for the Test Data for NN

AUC and ROC for the Train Data for NN

NN Confusion Matrix and Classification Report for the Train data

Data ake
Accuracy

Classification
Report

NN Metrics

Confusion
Matrix

Customers who have not taken claim = 1298

Customers that have taken claim but model
shows opposite = 315
Customers who have taken claims and model
also shows same = 332
Customer who havent taken claim but model
shows claim taken = 155
NN Confusion Matrix and Classification Report for the Test data

Data Accuracy

Classification Report

NN Metrics

Confusion Matrix

Customers who have not taken claim = 563

Customers that have taken claim but model
shows opposite = 138
Customers who have taken claims and model
also shows same = 139
Customer who havent taken claim but model
shows claim taken = 70
NN Model Conclusion

Parameters Train Data Test Data

AUC 81% 80%

Accuracy 77% 76%

Precision 68% 67%

F1-Score 59% 57%

Good model as the values of the test and train model are almost similar
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized.
Summary of the Models

ROC Curve for Training Data

ROC Curve for Test Data

From the above models and summary of data we can conclude than Random Forest is
the best model since it has got higher parameters in terms of Accuracy, Precision, F-1
Score, Recall and AUC as compared to other models
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations

Insurance is an industry which is dependent on many factors when claiming an

insurance like what are the Customer previous ailments, Accidents and weather
conditions, behavior patterns and vehicle types.

Just like Policy Bazar which is now hit in market for online insurance buying and claiming
we have also seen here than an online experience has helped customers which lead to
more conversions and aided in profit booking.

Looking at the Agency_Code where it is JZI, this agency is at the bottom when
considering the sales, they need good marketing campaigns and strategies.
Need to focus more on SEO based growth marketing.

Business/Companies need to motivate or hire new agencies to improve their sales and
marketing.

As per our data we have seen 80% accuracy, so we might have to do cross selling of
insurances based on Claimed Data Pattern.

As per our data and insights, we have seen that more claims processing happened
through Airlines, but sales are more when Agencies are involved.

We need to increase awareness on how customers can claim insurances, what are the
terms and conditions this will help the company to reduce claim cycles and, help with
good rating.
Also, need to look at insurance frauds and how that be eliminated. We would need to
look at past data and analyze it.

HesNotThatComplicated X
69% (35)
HesNotThatComplicated X
141 pages
New Wheels - Project - Report
No ratings yet
New Wheels - Project - Report
31 pages
Asm2 Programming 1nd Tranvantoan Bh00485
No ratings yet
Asm2 Programming 1nd Tranvantoan Bh00485
35 pages
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
No ratings yet
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
50 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
PCA Project Advanced Statistics
67% (3)
PCA Project Advanced Statistics
24 pages
Project - 8 (MRA)
50% (4)
Project - 8 (MRA)
15 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
Project Report - Data Mining
0% (1)
Project Report - Data Mining
52 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Data Mining Project
100% (2)
Data Mining Project
20 pages
Data Mining Graded Assignment: Problem 1: Clustering Analysis
100% (3)
Data Mining Graded Assignment: Problem 1: Clustering Analysis
39 pages
TSF Shoe Sales & Softdrink by Shubradip Ghosh Pgpdsba 2022 Mar
No ratings yet
TSF Shoe Sales & Softdrink by Shubradip Ghosh Pgpdsba 2022 Mar
61 pages
Business Report Sparkling Dataset - TSF
No ratings yet
Business Report Sparkling Dataset - TSF
26 pages
Machine Learning - Final Project Report - Problem 1
100% (1)
Machine Learning - Final Project Report - Problem 1
26 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
Suresh-Rose Time Series Forecasting Project Report
100% (1)
Suresh-Rose Time Series Forecasting Project Report
75 pages
Project 2 SMDM
50% (2)
Project 2 SMDM
5 pages
Project Report - Advanced - Stats - Final PDF
No ratings yet
Project Report - Advanced - Stats - Final PDF
25 pages
SMDM-Project Report (Madhur Dhananiwala)
100% (2)
SMDM-Project Report (Madhur Dhananiwala)
43 pages
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
100% (1)
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
11 pages
Extended Project
No ratings yet
Extended Project
1 page
AKSHAYA - Advanced Statistics Project Report
No ratings yet
AKSHAYA - Advanced Statistics Project Report
50 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Advanced Statistics Project
17% (6)
Advanced Statistics Project
2 pages
FRA Report
100% (1)
FRA Report
30 pages
Anamit Deb Gupta Mra - Project Milestone - 1
100% (1)
Anamit Deb Gupta Mra - Project Milestone - 1
30 pages
Lifi
100% (1)
Lifi
16 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Project ML
100% (4)
Project ML
36 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
SMDM Project Report
100% (1)
SMDM Project Report
9 pages
Problem Statement1
No ratings yet
Problem Statement1
1 page
MRA - Project - Puvya - Ravi
100% (3)
MRA - Project - Puvya - Ravi
46 pages
Fra Project Report-Bajaj Auto Ltd. Vs Hero Motocorp Ltd. (Group-X)
100% (1)
Fra Project Report-Bajaj Auto Ltd. Vs Hero Motocorp Ltd. (Group-X)
10 pages
Dbms db03 2020 Assessment (Solved) : Find Study Resources
50% (2)
Dbms db03 2020 Assessment (Solved) : Find Study Resources
12 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
Rajendra Ladda DVT Car Insurance Tableau Project
No ratings yet
Rajendra Ladda DVT Car Insurance Tableau Project
8 pages
Mini Project - Factor Hair Analysis: Sravanthi.M
100% (2)
Mini Project - Factor Hair Analysis: Sravanthi.M
24 pages
SMDM Project
No ratings yet
SMDM Project
16 pages
Capstone Project - Final Submission
No ratings yet
Capstone Project - Final Submission
36 pages
Project Advance Stats - Abhishek
No ratings yet
Project Advance Stats - Abhishek
14 pages
Milestone 1
No ratings yet
Milestone 1
2 pages
MRA Milestone-1 Graded Project
100% (2)
MRA Milestone-1 Graded Project
41 pages
Facebook Comment Volume Prediction
100% (1)
Facebook Comment Volume Prediction
12 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
34 pages
Business Report TSF - Rose DataSet
100% (4)
Business Report TSF - Rose DataSet
52 pages
Machine Learning (Project5) PDF
100% (2)
Machine Learning (Project5) PDF
13 pages
Cafe Chain Analysis - Janani Prakash
100% (1)
Cafe Chain Analysis - Janani Prakash
21 pages
Pranjal - Singh - 25.12.2022 - Data Mining Project
No ratings yet
Pranjal - Singh - 25.12.2022 - Data Mining Project
8 pages
Clustering Project
100% (1)
Clustering Project
44 pages
P L Lohitha 19-04-23 TSF Business Report
No ratings yet
P L Lohitha 19-04-23 TSF Business Report
70 pages
Data Mining Case Study PDF
100% (1)
Data Mining Case Study PDF
21 pages
MachineLearning Project PDF
No ratings yet
MachineLearning Project PDF
32 pages
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
No ratings yet
SMT Capstone PPT Ayushi Rastogi PGPDSBA.O.MAY22.C
12 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Algorithms Workbook GATE CSE MADE EASY
67% (6)
Algorithms Workbook GATE CSE MADE EASY
15 pages
MSS Type 14 Fig332
No ratings yet
MSS Type 14 Fig332
1 page
Traumatic Experiences of Nigerian Women in Purple Hibiscus
No ratings yet
Traumatic Experiences of Nigerian Women in Purple Hibiscus
10 pages
Flyworldshares New Plan
No ratings yet
Flyworldshares New Plan
17 pages
Portfolio Personal Philosophy of Nursing
No ratings yet
Portfolio Personal Philosophy of Nursing
8 pages
Learn Hypnosis About Hypnosis
No ratings yet
Learn Hypnosis About Hypnosis
2 pages
De Thi Hoc Ky 2 Tieng Anh 11
No ratings yet
De Thi Hoc Ky 2 Tieng Anh 11
6 pages
Learning From The Past Building Community in New Towns Growth Areas and New Communities
No ratings yet
Learning From The Past Building Community in New Towns Growth Areas and New Communities
47 pages
Ray Kurzweil - The Age of Intelligent Machines-The MIT Press (1990)
No ratings yet
Ray Kurzweil - The Age of Intelligent Machines-The MIT Press (1990)
590 pages
Siemens-SW-Simcenter-FLOEFD-Advanced-Module-Fact-Sheet
No ratings yet
Siemens-SW-Simcenter-FLOEFD-Advanced-Module-Fact-Sheet
3 pages
MATH219 LECTURE NOTE BOOKLET
No ratings yet
MATH219 LECTURE NOTE BOOKLET
96 pages
Sacub-ES BE Learning-Continuity-Plan-2021-2022
No ratings yet
Sacub-ES BE Learning-Continuity-Plan-2021-2022
13 pages
Total Intravenous Anaesthesia in Children-1 PDF
No ratings yet
Total Intravenous Anaesthesia in Children-1 PDF
6 pages
ISUZU
No ratings yet
ISUZU
5 pages
5 Perception: Consumer Behavior, 11E
No ratings yet
5 Perception: Consumer Behavior, 11E
35 pages
Coindcx Strategies
No ratings yet
Coindcx Strategies
4 pages
Computer Number Manual
No ratings yet
Computer Number Manual
55 pages
Psychology of Language 5th Edition Carroll Test Bank
100% (60)
Psychology of Language 5th Edition Carroll Test Bank
25 pages
Good Luck With Your Homework in French
100% (1)
Good Luck With Your Homework in French
8 pages
Project Xii
No ratings yet
Project Xii
9 pages
Instrument Detection Limit For LCMS - Internal Training
100% (1)
Instrument Detection Limit For LCMS - Internal Training
22 pages
ASLE008 TEACHING COMMON COMPETENCIES IN IA - BALLARDA - PDF 9 - 7 - 2021
No ratings yet
ASLE008 TEACHING COMMON COMPETENCIES IN IA - BALLARDA - PDF 9 - 7 - 2021
13 pages
Sample Math 10
No ratings yet
Sample Math 10
2 pages
Research methods for social work 7th ed., International ed Edition Babbie - Download the ebook now and own the full detailed content
No ratings yet
Research methods for social work 7th ed., International ed Edition Babbie - Download the ebook now and own the full detailed content
52 pages
PDF Research Paper LLM Executive Sem I - Compress
No ratings yet
PDF Research Paper LLM Executive Sem I - Compress
32 pages
ABB Lumnous Patent For Helical Baffles
No ratings yet
ABB Lumnous Patent For Helical Baffles
12 pages
Rahim Karim J 201410 PHD
No ratings yet
Rahim Karim J 201410 PHD
210 pages

Data Mining Project Anshul

Uploaded by

Data Mining Project Anshul

Uploaded by

Project – Data Mining

Introduction of Data Set

Rows and Columns

Description of data set

Checking for Duplicate values

Analysis of Variable – Spending

Analysis of Variable – advance_payments

Analysis of Variable – current_balance

Analysis of Variable – min_payment_amt

Highlights of Pair Plot Map

 Max_spent_in_single_shopping and Current balance

We will be using Z-score technique to scale the data.

Scaled Data below

Min_payment_amt and probability_of_full_payments After Outlier treatment

Outcomes from the above Cluster Analysis

The optimum number of clusters can be 2 or 3.

Brief Summary about the Clusters

Credit limit is lowest for cluster 0 when compared to Cluster 1 and 2.

Constructing a Tabular Format for easier analysis and Understanding

We have segregated the Clusters as per below

(Premium/Elite Customers (Normal Customers – (Low Spender Customers)

 These customers have  They have above  We have observed a

Description of data set

Checking for Duplicate values

Analysis of Variable – Age

Analysis of Variable – Commission

Converting Categorical Values to Numerical Values

Plotted Graph showing Data before Scaling

Building a Decision Tree Classifier

Getting the Predicted Classes and Probs

Doing Variable Importance – RFCL

Predicting and Training the Data

2.3 Performance Metrics: Comment and Check the performance of Predictions on Train

CART Performance Metrics

AUC and ROC for the Test Data for CART

AUC and ROC for the Train Data for CART

Customers who have not taken claim = 536

Customers who have not taken claim =

Parameters Train Data Test Data

AUC 81% 79%

Accuracy 77% 77%

Precision 66% 65%

F1-Score 62% 62%

Random Forest Model Performance Metrics

AUC and ROC for the Test Data for RF

AUC and ROC for the Train Data for RF

Customers who have not taken claim =

Customers who have not taken claim =

Parameters Train Data Test Data

AUC 86% 81%

Accuracy 80% 77%

Precision 71% 67%

F1-Score 65% 59%

Neural Networks Model Performance Metrics

AUC and ROC for the Test Data for NN

AUC and ROC for the Train Data for NN

Customers who have not taken claim = 1298

Customers who have not taken claim = 563

Parameters Train Data Test Data

AUC 81% 80%

Accuracy 77% 76%

Precision 68% 67%

F1-Score 59% 57%

ROC Curve for Training Data

ROC Curve for Test Data

Insurance is an industry which is dependent on many factors when claiming an

You might also like