0% found this document useful (1 vote)
385 views

Project Report - Data Mining

The document outlines a problem involving customer segmentation for a bank using credit card usage data. It describes: 1) Exploratory data analysis including univariate, bivariate, and multivariate analyses of the data. 2) Applying hierarchical and K-means clustering algorithms to identify optimal customer segments. Three clusters were identified representing high, medium, and low spending customers. 3) Recommending different promotional strategies tailored to each customer segment, such as increasing credit limits, offering loans or discounts for high spending customers and cross-selling products to low spending customers.

Uploaded by

Ruhee's Kitchen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
385 views

Project Report - Data Mining

The document outlines a problem involving customer segmentation for a bank using credit card usage data. It describes: 1) Exploratory data analysis including univariate, bivariate, and multivariate analyses of the data. 2) Applying hierarchical and K-means clustering algorithms to identify optimal customer segments. Three clusters were identified representing high, medium, and low spending customers. 3) Recommending different promotional strategies tailored to each customer segment, such as increasing credit limits, offering loans or discounts for high spending customers and cross-selling products to low spending customers.

Uploaded by

Ruhee's Kitchen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Table of Contents

1. Problem 1: Clustering
1.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis) ……………. 3

1.2 Do you think scaling is necessary for clustering in this case?


Justify……………………………………………………………………………………………… 15

1.3 Apply hierarchical clustering to scaled data. Identify the number of


optimum clusters using Dendrogram and briefly describe them……… 16

1.4 Apply K-Means clustering on scaled data and determine optimum


clusters. Apply elbow curve and silhouette score. Explain the results
properly. Interpret and write inferences on the finalized clusters……. 20

1.5 Describe cluster profiles for the clusters defined. Recommend


different promotional strategies for different clusters…………………… 24

2. Problem 2: CART-RF-ANN
2.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis) …………… 27

2.2 Data Split: Split the data into test and train, build classification
model CART, Random Forest, Artificial Neural Network. …………….. 36

2.3 Performance Metrics: Comment and Check the performance of


Predictions on Train and Test sets using Accuracy, Confusion Matrix,
Plot ROC curve and get ROC_AUC score, classification reports for each
model……………………………………………………………………………………………… 36

2.4 Final Model: Compare all the models and write an inference which
model is best/optimized. ………………………………………………………………. 51

2.5 Inference: Based on the whole Analysis, what are the business
insights and recommendations……………………………………………………… 52

1
Problem 1: Clustering
A leading bank wants to develop a customer
segmentation to give promotional offers to its customers.
They collected a sample that summarizes the activities of
users during the past few months. You are given the task
to identify the segments based on credit card usage.

1.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis).

From top and bottom i.e., head and tail function data we can say that data is
healthy, or we have good data from initial records.

2
The data set consist of 210 rows and 7 columns. So here we have 7 different
attributes, all have same datatypes as float.
There are no null entries present in it and also no duplicate values.

As all columns are numeric, description of all is presented here, min, max,
std, 25%, 50%, 75%, total number of counts present.

3
Univariant Analysis

4
From the above plots, only in min_payment_amt and
probability_of_full_payment has the outliers.

5
I’m also checking for the lower limit, upper limit, IQR, and the percent
outliers present or probability of the outlier present.

Spending:

advance_payments

probability_of_full_payment

6
current_balance

credit_limit

min_payment_amt

max_spent_in_single_shopping

7
8
Here we have outliers in two columns min_payment_amt and
probability_of_full_payment upper as we have seen with the box plot and
with the equation both.

(Considering the amount in dollars)


Credit limit average is around 3.258(10000s)
max_spent_in_single_shopping average is around 5.408(1000s)
advance_payments average is around 14.559 (100s)
spending average is around 14.847 (1000s)
probability_of_full_payment average is around 87%
current_balance average is around 5.628 (1000s)
min_payment_amt average is around 3.700(100s)

Outlier in min_payment_amt upper: 1 %


Outlier in probability_of_full_payment lower: 1 %

Distribution is skewed to right tail for all the variable except


probability_of_full_payment variable, which has left tail.

9
Multivariate analysis
Check for multicollinearity

10
Here we can see both negative and positive correlation. Listing just the
strong positive correlation is between.
-spending and advance_payments
-spending and current_balance
-spending and credit_limit
-advance_payments and current_balance

11
As of now, for this we are not dropping the outlier values instead of
dropping we will treat it with their respective medians, as mean gets
affected by the outlier so, as I think median is the best option for treating
it.
Only two variables have the outliers treating is the best option, so we will
not loose the other relevant information which also seems important.

12
Know most of the outliers have been treated, and our data is good to go
for further analysis.

13
1.2- Do you think scaling is necessary for clustering in this case?
Justify.
Scaling is needed to done as all variables have different values. Scaling will
provide us all values with same range, that becomes more convenient for
us. After scaling data become more cleaner or comes in proper manner for
further analysis.
The standard normal distribution just converts the group of data in our
frequency distribution such that the mean is 0 and standard deviation is 1.
Normalization is used to eliminate redundant data and ensures that good
quality clusters are generated which can improve the efficiency of
clustering algorithms. So, it becomes essential step before clustering
as Euclideandistance is very sensitive to the changes in the differences all
dimensions are equally important.
Here I’m using z-score to standardize the data to relative same scale -3 to
+3
Data before and after scaling:

Data looks much better after scaling.

14
1.3 Apply hierarchical clustering to scaled data. Identify the number of
optimum clusters using Dendrogram and briefly describe them.
Here I’m using all the three approaches:
1- Linkage Method

15
2-Ward Link Method:

16
17
3- Agglomerative Clustering:

Here I have shown the results for all the approaches, we can see there is
not much difference. As we know when we use different approaches
minute difference/ minor variations occurs.
For cluster grouping based on dendrograms, we can say 3 looks good. It
gives us the solution based on spending (high, medium, low).
We have cluster 1 as highest spending, cluster 2 as medium spending,
cluster 3 as lowest spending in linkage and cluster 0 as lowest spending in
Agglomerative.

18
1.4 Apply K-Means clustering on scaled data and determine optimum
clusters. Apply elbow curve and silhouette score. Explain the results
properly. Interpret and write inferences on the finalized clusters.

After cluster 3-4 there is a minimal drop in the values.

19
Silhouette score.

20
From the above graph and silhouette score 3-4 is optimal number of
clustering.

21
22
Here I’m going with 3 group clustering via kmeans, as it makes sense
based on spending pattern (high, medium, low).

1.5 Describe cluster profiles for the clusters defined. Recommend


different promotional strategies for different clusters.
3 group cluster via Kmeans

23
3 group cluster via hierarchical clustering

Cluster Profile:
Group1: Highest Spending
Group2: Medium Spending
Group3: Lowest Spending.
Promotional strategies for different clusters:
Group1: Highest Spending Group
• Group 1 people are spending more money and also advance
payment done is high as compared to other two clusters.so these
people are the main target.
• As the advance payment is also high, Increase the credit limit, give
loans on their credit cards, as they are the customers with good
payment records.
• Giving reward points might attract them, and increase purchases.
• Also providing with the discounted offers on next transaction for
one-time full payment will be beneficial, as
max_spent_in_single_shopping is high.

24
Group2: Medium Spending Group
• These are potential target customers, who are paying bills, doing
purchases and maintaining, good credit score. So, here we can
increase the credit limit.
• Also providing some discounts / offers will increase the purchase.
• As from the cluster 3 group these set of people also have 2nd highest
advanced payment done, here also we can recommend to give loans
on their credit cards.

Group3: Lowest Spending Group


• Offers/discounts should be provided for early payment option.
• A gentle remainder for there payments regarding should be given.
• Also look for opportunities to cross-sell products to the customers,
so as to increase the purchase.

-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------

25
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim
frequency. The management decides to collect data from the past few
years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF &
ANN and compare the models' performances in train and test sets.
Attribute Information:
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency_Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration)
7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)
9. The commission received for tour insurance firm (Commission)
10. Age of insured (Age)

2.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis).

26
• There is total 3000 numbers of rows and 10 number of columns.
• No null entries present in it.
• Age, Commission, Duration, Sales have numeric datatypes, rest all
have object datatype.
• There is total 9 independent variables and 1 target variable(claimed).

27
Here, we have negative entries, which we can say might be a wrong
entry.
Getting Unique Values For categorical variables

28
Check for duplicates:

Though it shows there are 139 records, but it can be of different


customers, there is no customer ID or any unique identifier, so I am
not dropping them off.
Univariate Analysis:

29
Here there is outliers in all variables , as sales and commission can have extrem values .
Random forest and Cart model can handel this , so not treating the ouliers now.
We will treat the outliers while ANN model.

All the 4 variables are positively skeweed.

30
Categorical Variables:

31
Checking pairwise distribution of the continuous variables:

32
Checking for Correlations:

Here we can say that there not strong correlation between the variables.
Just sales and commission have a correlation of 0.77.
As the sales increases, commission also increases.

Converting all objects to categorical codes:

33
34
Again, checking the info (), all object datatypes are converted to numeric
datatype(int).

Checked for the proportion of 0s and 1s.


0=No and 1=Yes.
So here we have 69% of the data not claimed and 30% of the data with
claimed.

After converting the datatypes to numeric again checking the skewness


for all the variables.
2.2 Data Split: Split the data into test and train, build classification model
CART, Random Forest, Artificial Neural Network.
2.3 Performance Metrics: Comment and Check the performance of
Predictions on Train and Test sets using Accuracy, Confusion Matrix, Plot
ROC curve and get ROC_AUC score, classification reports for each model.
(Answers for both the questions are given together)

35
All the required libraries have been imported.
Extracting the target column into separate vectors for training set and test
set

Data before scaling:

36
Data after scaling:

Splitting data into training and test set:

Building a Decision Tree:


Checking for different parameters:

37
Generating a Tree:

http://webgraphviz.com/

Variable Importance dtcl:

38
looking at the above important parameters the model highly depends
upon at "Agency Code" i.e.,63.41% and "Sales" i.e.,22%.

Predicting on Training and testing data

Getting the Predicted Classes and Probs

Model Evaluation
AUC and ROC for the training data

39
AUC and ROC for the Testing Data

40
Confusion Matrix for training data-dtcl

Confusion Matrix for test data-dtcl

41
Cart Conclusion:
Train Data:
AUC:82%
Accuracy:79%
Precision:70%
F1-score:60%
Test Data:
AUC:80%
Accuracy:77%
Precision:67%
F1-score:58%
Training and Test set results are almost similar, and with the overall
measures high, the model is a good model. Agency_code is the most
important variable for predicting insurance claimed.

Building a Random Forest Classifier:

42
Predicting the Training and Testing data:

43
RF Model Performance Evaluation on Training data:

44
RF Model Performance Evaluation on Test data:

45
Random Forest Conclusion
Train Data:
AUC:86%
Accuracy:80%
Precision:71%
F1-score:65%
Test Data:
AUC:82%
Accuracy:77%
Precision:66%
F1-score:59%
Training and Test set results are almost similar, and with the overall
measures high, the model is a good model. Agency_code is the most
important variable for predicting insurance claimed.

Building a Neural Network Classifier:

46
Predicting the Training and Testing data:

47
NN Model Performance Evaluation on Test data:

48
Neural Network Conclusion:
Train Data:
AUC:82%
Accuracy:78%
Precision:68%
F1-score:59%
Test Data:
AUC:80%
Accuracy:77%
Precision:67%
F1-score:57%

Training and Test set results are almost similar, and with the overall
measures high, the model is a good model.

49
2.4 Final Model: Compare all the models and write an inference which
model is best/optimized.

ROC Curve for the 3 models on the Training data

50
ROC Curve for the 3 models on the Test data:

Here I’m selecting Random Forest model, as it has better Accuracy,


precision, f1-score, recall other than Cart and Neural networks. That we
can see from the above table and also from graph.

2.5 Inference: Based on the whole Analysis, what are the


business insights and recommendations.
The main objective of the project was to develop a predictive model to predict if An
Insurance firm providing tour insurance is facing higher claim frequency or not.
As per the data 90% of insurance is done by online channel. almost all the offline
business has a claimed associated. JZI agency resources need to pick up sales as they
are in bottom, need to run promotional marketing campaign or evaluate if we need to
tie up with alternate agency, also can provide reward points or discounts accordingly.
As per our model we have accuracy of approx. 80%, so on the selling or purchase of
airline tickets we can provide cross selling of insurance claim pattern, so increase in
profit.
Also, we can say that the claims are processed more by airlines then the travel agency,
and as per sales pattern the sales made are high at travel agencies.
Increase customer satisfaction. Reduce claim handling costs.

51
52

You might also like