0% found this document useful (1 vote)

385 views

Project Report - Data Mining

The document outlines a problem involving customer segmentation for a bank using credit card usage data. It describes: 1) Exploratory data analysis including univariate, bivariate, and multivariate analyses of the data. 2) Applying hierarchical and K-means clustering algorithms to identify optimal customer segments. Three clusters were identified representing high, medium, and low spending customers. 3) Recommending different promotional strategies tailored to each customer segment, such as increasing credit limits, offering loans or discounts for high spending customers and cross-selling products to low spending customers.

Uploaded by

Ruhee's Kitchen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

385 views

Project Report - Data Mining

Uploaded by

Ruhee's Kitchen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

1. Problem 1: Clustering
1.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis) ……………. 3

1.2 Do you think scaling is necessary for clustering in this case?

Justify……………………………………………………………………………………………… 15

1.3 Apply hierarchical clustering to scaled data. Identify the number of

optimum clusters using Dendrogram and briefly describe them……… 16

1.4 Apply K-Means clustering on scaled data and determine optimum

clusters. Apply elbow curve and silhouette score. Explain the results
properly. Interpret and write inferences on the finalized clusters……. 20

1.5 Describe cluster profiles for the clusters defined. Recommend

different promotional strategies for different clusters…………………… 24

2. Problem 2: CART-RF-ANN
2.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis) …………… 27

2.2 Data Split: Split the data into test and train, build classification
model CART, Random Forest, Artificial Neural Network. …………….. 36

2.3 Performance Metrics: Comment and Check the performance of

Predictions on Train and Test sets using Accuracy, Confusion Matrix,
Plot ROC curve and get ROC_AUC score, classification reports for each
model……………………………………………………………………………………………… 36

2.4 Final Model: Compare all the models and write an inference which
model is best/optimized. ………………………………………………………………. 51

2.5 Inference: Based on the whole Analysis, what are the business
insights and recommendations……………………………………………………… 52

1
Problem 1: Clustering
A leading bank wants to develop a customer
segmentation to give promotional offers to its customers.
They collected a sample that summarizes the activities of
users during the past few months. You are given the task
to identify the segments based on credit card usage.

1.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis).

From top and bottom i.e., head and tail function data we can say that data is
healthy, or we have good data from initial records.

2
The data set consist of 210 rows and 7 columns. So here we have 7 different
attributes, all have same datatypes as float.
There are no null entries present in it and also no duplicate values.

As all columns are numeric, description of all is presented here, min, max,
std, 25%, 50%, 75%, total number of counts present.

3
Univariant Analysis

4
From the above plots, only in min_payment_amt and
probability_of_full_payment has the outliers.

5
I’m also checking for the lower limit, upper limit, IQR, and the percent
outliers present or probability of the outlier present.

Spending:

advance_payments

probability_of_full_payment

6
current_balance

credit_limit

min_payment_amt

max_spent_in_single_shopping

7
8
Here we have outliers in two columns min_payment_amt and
probability_of_full_payment upper as we have seen with the box plot and
with the equation both.

(Considering the amount in dollars)

Credit limit average is around 3.258(10000s)
max_spent_in_single_shopping average is around 5.408(1000s)
advance_payments average is around 14.559 (100s)
spending average is around 14.847 (1000s)
probability_of_full_payment average is around 87%
current_balance average is around 5.628 (1000s)
min_payment_amt average is around 3.700(100s)

Outlier in min_payment_amt upper: 1 %

Outlier in probability_of_full_payment lower: 1 %

Distribution is skewed to right tail for all the variable except

probability_of_full_payment variable, which has left tail.

9
Multivariate analysis
Check for multicollinearity

10
Here we can see both negative and positive correlation. Listing just the
strong positive correlation is between.
-spending and advance_payments
-spending and current_balance
-spending and credit_limit
-advance_payments and current_balance

11
As of now, for this we are not dropping the outlier values instead of
dropping we will treat it with their respective medians, as mean gets
affected by the outlier so, as I think median is the best option for treating
it.
Only two variables have the outliers treating is the best option, so we will
not loose the other relevant information which also seems important.

12
Know most of the outliers have been treated, and our data is good to go
for further analysis.

13
1.2- Do you think scaling is necessary for clustering in this case?
Justify.
Scaling is needed to done as all variables have different values. Scaling will
provide us all values with same range, that becomes more convenient for
us. After scaling data become more cleaner or comes in proper manner for
further analysis.
The standard normal distribution just converts the group of data in our
frequency distribution such that the mean is 0 and standard deviation is 1.
Normalization is used to eliminate redundant data and ensures that good
quality clusters are generated which can improve the efficiency of
clustering algorithms. So, it becomes essential step before clustering
as Euclideandistance is very sensitive to the changes in the differences all
dimensions are equally important.
Here I’m using z-score to standardize the data to relative same scale -3 to
+3
Data before and after scaling:

Data looks much better after scaling.

14
1.3 Apply hierarchical clustering to scaled data. Identify the number of
optimum clusters using Dendrogram and briefly describe them.
Here I’m using all the three approaches:
1- Linkage Method

15
2-Ward Link Method:

16
17
3- Agglomerative Clustering:

Here I have shown the results for all the approaches, we can see there is
not much difference. As we know when we use different approaches
minute difference/ minor variations occurs.
For cluster grouping based on dendrograms, we can say 3 looks good. It
gives us the solution based on spending (high, medium, low).
We have cluster 1 as highest spending, cluster 2 as medium spending,
cluster 3 as lowest spending in linkage and cluster 0 as lowest spending in
Agglomerative.

18
1.4 Apply K-Means clustering on scaled data and determine optimum
clusters. Apply elbow curve and silhouette score. Explain the results
properly. Interpret and write inferences on the finalized clusters.

After cluster 3-4 there is a minimal drop in the values.

19
Silhouette score.

20
From the above graph and silhouette score 3-4 is optimal number of
clustering.

21
22
Here I’m going with 3 group clustering via kmeans, as it makes sense
based on spending pattern (high, medium, low).

1.5 Describe cluster profiles for the clusters defined. Recommend

different promotional strategies for different clusters.
3 group cluster via Kmeans

23
3 group cluster via hierarchical clustering

Cluster Profile:
Group1: Highest Spending
Group2: Medium Spending
Group3: Lowest Spending.
Promotional strategies for different clusters:
Group1: Highest Spending Group
• Group 1 people are spending more money and also advance
payment done is high as compared to other two clusters.so these
people are the main target.
• As the advance payment is also high, Increase the credit limit, give
loans on their credit cards, as they are the customers with good
payment records.
• Giving reward points might attract them, and increase purchases.
• Also providing with the discounted offers on next transaction for
one-time full payment will be beneficial, as
max_spent_in_single_shopping is high.

24
Group2: Medium Spending Group
• These are potential target customers, who are paying bills, doing
purchases and maintaining, good credit score. So, here we can
increase the credit limit.
• Also providing some discounts / offers will increase the purchase.
• As from the cluster 3 group these set of people also have 2nd highest
advanced payment done, here also we can recommend to give loans
on their credit cards.

Group3: Lowest Spending Group

• Offers/discounts should be provided for early payment option.
• A gentle remainder for there payments regarding should be given.
• Also look for opportunities to cross-sell products to the customers,
so as to increase the purchase.

-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------

25
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim
frequency. The management decides to collect data from the past few
years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF &
ANN and compare the models' performances in train and test sets.
Attribute Information:
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency_Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration)
7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)
9. The commission received for tour insurance firm (Commission)
10. Age of insured (Age)

2.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis).

26
• There is total 3000 numbers of rows and 10 number of columns.
• No null entries present in it.
• Age, Commission, Duration, Sales have numeric datatypes, rest all
have object datatype.
• There is total 9 independent variables and 1 target variable(claimed).

27
Here, we have negative entries, which we can say might be a wrong
entry.
Getting Unique Values For categorical variables

28
Check for duplicates:

Though it shows there are 139 records, but it can be of different

customers, there is no customer ID or any unique identifier, so I am
not dropping them off.
Univariate Analysis:

29
Here there is outliers in all variables , as sales and commission can have extrem values .
Random forest and Cart model can handel this , so not treating the ouliers now.
We will treat the outliers while ANN model.

All the 4 variables are positively skeweed.

30
Categorical Variables:

31
Checking pairwise distribution of the continuous variables:

32
Checking for Correlations:

Here we can say that there not strong correlation between the variables.
Just sales and commission have a correlation of 0.77.
As the sales increases, commission also increases.

Converting all objects to categorical codes:

33
34
Again, checking the info (), all object datatypes are converted to numeric
datatype(int).

Checked for the proportion of 0s and 1s.

0=No and 1=Yes.
So here we have 69% of the data not claimed and 30% of the data with
claimed.

After converting the datatypes to numeric again checking the skewness

for all the variables.
2.2 Data Split: Split the data into test and train, build classification model
CART, Random Forest, Artificial Neural Network.
2.3 Performance Metrics: Comment and Check the performance of
Predictions on Train and Test sets using Accuracy, Confusion Matrix, Plot
ROC curve and get ROC_AUC score, classification reports for each model.
(Answers for both the questions are given together)

35
All the required libraries have been imported.
Extracting the target column into separate vectors for training set and test
set

Data before scaling:

36
Data after scaling:

Splitting data into training and test set:

Building a Decision Tree:

Checking for different parameters:

37
Generating a Tree:

http://webgraphviz.com/

Variable Importance dtcl:

38
looking at the above important parameters the model highly depends
upon at "Agency Code" i.e.,63.41% and "Sales" i.e.,22%.

Predicting on Training and testing data

Getting the Predicted Classes and Probs

Model Evaluation
AUC and ROC for the training data

39
AUC and ROC for the Testing Data

40
Confusion Matrix for training data-dtcl

Confusion Matrix for test data-dtcl

41
Cart Conclusion:
Train Data:
AUC:82%
Accuracy:79%
Precision:70%
F1-score:60%
Test Data:
AUC:80%
Accuracy:77%
Precision:67%
F1-score:58%
Training and Test set results are almost similar, and with the overall
measures high, the model is a good model. Agency_code is the most
important variable for predicting insurance claimed.

Building a Random Forest Classifier:

42
Predicting the Training and Testing data:

43
RF Model Performance Evaluation on Training data:

44
RF Model Performance Evaluation on Test data:

45
Random Forest Conclusion
Train Data:
AUC:86%
Accuracy:80%
Precision:71%
F1-score:65%
Test Data:
AUC:82%
Accuracy:77%
Precision:66%
F1-score:59%
Training and Test set results are almost similar, and with the overall
measures high, the model is a good model. Agency_code is the most
important variable for predicting insurance claimed.

Building a Neural Network Classifier:

46
Predicting the Training and Testing data:

47
NN Model Performance Evaluation on Test data:

48
Neural Network Conclusion:
Train Data:
AUC:82%
Accuracy:78%
Precision:68%
F1-score:59%
Test Data:
AUC:80%
Accuracy:77%
Precision:67%
F1-score:57%

Training and Test set results are almost similar, and with the overall
measures high, the model is a good model.

49
2.4 Final Model: Compare all the models and write an inference which
model is best/optimized.

ROC Curve for the 3 models on the Training data

50
ROC Curve for the 3 models on the Test data:

Here I’m selecting Random Forest model, as it has better Accuracy,

precision, f1-score, recall other than Cart and Neural networks. That we
can see from the above table and also from graph.

2.5 Inference: Based on the whole Analysis, what are the

business insights and recommendations.
The main objective of the project was to develop a predictive model to predict if An
Insurance firm providing tour insurance is facing higher claim frequency or not.
As per the data 90% of insurance is done by online channel. almost all the offline
business has a claimed associated. JZI agency resources need to pick up sales as they
are in bottom, need to run promotional marketing campaign or evaluate if we need to
tie up with alternate agency, also can provide reward points or discounts accordingly.
As per our model we have accuracy of approx. 80%, so on the selling or purchase of
airline tickets we can provide cross selling of insurance claim pattern, so increase in
profit.
Also, we can say that the claims are processed more by airlines then the travel agency,
and as per sales pattern the sales made are high at travel agencies.
Increase customer satisfaction. Reduce claim handling costs.

51
52

The Practice of Statistics Solution Manual
67% (9)
The Practice of Statistics Solution Manual
168 pages
Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
No ratings yet
Project On Data Mining: Prepared by Ashish Pavan Kumar K PGP-DSBA at Great Learning
50 pages
Project Questions
No ratings yet
Project Questions
4 pages
Project SMDM Kundan Sinha PDF
0% (1)
Project SMDM Kundan Sinha PDF
4 pages
PCA Project Advanced Statistics
67% (3)
PCA Project Advanced Statistics
24 pages
Asphalt Shingles Data Analysis PDF
No ratings yet
Asphalt Shingles Data Analysis PDF
4 pages
Machine Learning Assignment Report - Cars
100% (4)
Machine Learning Assignment Report - Cars
42 pages
Prob 3
No ratings yet
Prob 3
2 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Data Mining Project Anshul
100% (1)
Data Mining Project Anshul
48 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
100% (1)
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
12 pages
Data Mining Project
100% (2)
Data Mining Project
20 pages
Wholesale Customers Data Analysis PDF
No ratings yet
Wholesale Customers Data Analysis PDF
25 pages
Project Report - Advanced - Stats - Final PDF
No ratings yet
Project Report - Advanced - Stats - Final PDF
25 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Extended Project
No ratings yet
Extended Project
1 page
Problem Statement 1
100% (1)
Problem Statement 1
17 pages
SMDM-Project Report (Madhur Dhananiwala)
100% (2)
SMDM-Project Report (Madhur Dhananiwala)
43 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Pranjal - Singh - 25.12.2022 - Data Mining Project
No ratings yet
Pranjal - Singh - 25.12.2022 - Data Mining Project
8 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
Project 2 SMDM
50% (2)
Project 2 SMDM
5 pages
CLUSTERING ANALYSIS State Wise Health PDF
No ratings yet
CLUSTERING ANALYSIS State Wise Health PDF
14 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Anamit Deb Gupta Mra - Project Milestone - 1
100% (1)
Anamit Deb Gupta Mra - Project Milestone - 1
30 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
Data Mining Project
100% (1)
Data Mining Project
24 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Project Report
100% (3)
Project Report
36 pages
Capstone - 1 Notes - Vikas Chauhan PDF
100% (3)
Capstone - 1 Notes - Vikas Chauhan PDF
13 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Suresh-Rose Time Series Forecasting Project Report
100% (1)
Suresh-Rose Time Series Forecasting Project Report
75 pages
Business Report Pradeep Chauhan 11june'23
100% (1)
Business Report Pradeep Chauhan 11june'23
25 pages
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
100% (1)
Project Advanced Statistics UMESHHASIJA SEP2021 Jupyter File
25 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
Project Advance Stats - Abhishek
No ratings yet
Project Advance Stats - Abhishek
14 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Data Mining Case Study PDF
100% (1)
Data Mining Case Study PDF
21 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
ML Project Report
100% (2)
ML Project Report
35 pages
Dbms db03 2020 Assessment (Solved) : Find Study Resources
50% (2)
Dbms db03 2020 Assessment (Solved) : Find Study Resources
12 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Lifi
100% (1)
Lifi
16 pages
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
100% (1)
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
11 pages
Capstone Project
100% (1)
Capstone Project
7 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
SQL Quiz Results
No ratings yet
SQL Quiz Results
17 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Malunggay Seeds As Water Purifier
100% (1)
Malunggay Seeds As Water Purifier
7 pages
Nature of Quantitative Research
No ratings yet
Nature of Quantitative Research
28 pages
27.02.2024 For Students
No ratings yet
27.02.2024 For Students
7 pages
HEC 4 UsersManual (CPD 4)
No ratings yet
HEC 4 UsersManual (CPD 4)
104 pages
XLSTAT Features by Solution - Statistical Software For Excel
No ratings yet
XLSTAT Features by Solution - Statistical Software For Excel
9 pages
Assignment 4
No ratings yet
Assignment 4
3 pages
Stat7055 T01
No ratings yet
Stat7055 T01
4 pages
Griffin Dissertation
No ratings yet
Griffin Dissertation
148 pages
Introduction To Minor Programme 2021
No ratings yet
Introduction To Minor Programme 2021
9 pages
PTJ 1636
No ratings yet
PTJ 1636
10 pages
Improving Business Processes With Six Sigma - Soren Bisgaard
No ratings yet
Improving Business Processes With Six Sigma - Soren Bisgaard
62 pages
Research Pending
No ratings yet
Research Pending
232 pages
Indec Informa 08 24
No ratings yet
Indec Informa 08 24
204 pages
A Better Lemon Squeezer? Maximum-Likelihood Regression With Beta-Distributed Dependent Variables
No ratings yet
A Better Lemon Squeezer? Maximum-Likelihood Regression With Beta-Distributed Dependent Variables
18 pages
WOLDIA UNIVERSITY
No ratings yet
WOLDIA UNIVERSITY
80 pages
Regression An Ova
No ratings yet
Regression An Ova
24 pages
Operations Management 5th Edition CH 11 - Forecasting by Robert Russell & Bernard W. Taylor
No ratings yet
Operations Management 5th Edition CH 11 - Forecasting by Robert Russell & Bernard W. Taylor
57 pages
1. Framing a Validity Argument for Test Use and Impact_ the Malaysian Public Service Experience
No ratings yet
1. Framing a Validity Argument for Test Use and Impact_ the Malaysian Public Service Experience
345 pages
GSARJEL352024-Gelary-script
No ratings yet
GSARJEL352024-Gelary-script
11 pages
Feature Engineering
No ratings yet
Feature Engineering
23 pages
3 Residual Analysis
No ratings yet
3 Residual Analysis
5 pages
Case Study On Tea Cups Sold: by Ankit Rai Marwah
No ratings yet
Case Study On Tea Cups Sold: by Ankit Rai Marwah
11 pages
Analysis of Los Angeles Photochemical Smog Data A Statistical Overview
No ratings yet
Analysis of Los Angeles Photochemical Smog Data A Statistical Overview
10 pages
TR Commercial Cooking NC IV
No ratings yet
TR Commercial Cooking NC IV
125 pages
Queing Model
No ratings yet
Queing Model
42 pages
Types of Errors in Hypothesis Testing
100% (1)
Types of Errors in Hypothesis Testing
18 pages
TRIPOD Checklist: Prediction Model Development: Section/Topic Item Checklist Item Title and Abstract
No ratings yet
TRIPOD Checklist: Prediction Model Development: Section/Topic Item Checklist Item Title and Abstract
1 page
March, 2016
No ratings yet
March, 2016
363 pages
Statistics and Probability: Quarter 2 Week 3 Test of Hypothesis
No ratings yet
Statistics and Probability: Quarter 2 Week 3 Test of Hypothesis
6 pages