Project Report - Data Mining
Project Report - Data Mining
1. Problem 1: Clustering
1.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis) ……………. 3
2. Problem 2: CART-RF-ANN
2.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis) …………… 27
2.2 Data Split: Split the data into test and train, build classification
model CART, Random Forest, Artificial Neural Network. …………….. 36
2.4 Final Model: Compare all the models and write an inference which
model is best/optimized. ………………………………………………………………. 51
2.5 Inference: Based on the whole Analysis, what are the business
insights and recommendations……………………………………………………… 52
1
Problem 1: Clustering
A leading bank wants to develop a customer
segmentation to give promotional offers to its customers.
They collected a sample that summarizes the activities of
users during the past few months. You are given the task
to identify the segments based on credit card usage.
1.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis).
From top and bottom i.e., head and tail function data we can say that data is
healthy, or we have good data from initial records.
2
The data set consist of 210 rows and 7 columns. So here we have 7 different
attributes, all have same datatypes as float.
There are no null entries present in it and also no duplicate values.
As all columns are numeric, description of all is presented here, min, max,
std, 25%, 50%, 75%, total number of counts present.
3
Univariant Analysis
4
From the above plots, only in min_payment_amt and
probability_of_full_payment has the outliers.
5
I’m also checking for the lower limit, upper limit, IQR, and the percent
outliers present or probability of the outlier present.
Spending:
advance_payments
probability_of_full_payment
6
current_balance
credit_limit
min_payment_amt
max_spent_in_single_shopping
7
8
Here we have outliers in two columns min_payment_amt and
probability_of_full_payment upper as we have seen with the box plot and
with the equation both.
9
Multivariate analysis
Check for multicollinearity
10
Here we can see both negative and positive correlation. Listing just the
strong positive correlation is between.
-spending and advance_payments
-spending and current_balance
-spending and credit_limit
-advance_payments and current_balance
11
As of now, for this we are not dropping the outlier values instead of
dropping we will treat it with their respective medians, as mean gets
affected by the outlier so, as I think median is the best option for treating
it.
Only two variables have the outliers treating is the best option, so we will
not loose the other relevant information which also seems important.
12
Know most of the outliers have been treated, and our data is good to go
for further analysis.
13
1.2- Do you think scaling is necessary for clustering in this case?
Justify.
Scaling is needed to done as all variables have different values. Scaling will
provide us all values with same range, that becomes more convenient for
us. After scaling data become more cleaner or comes in proper manner for
further analysis.
The standard normal distribution just converts the group of data in our
frequency distribution such that the mean is 0 and standard deviation is 1.
Normalization is used to eliminate redundant data and ensures that good
quality clusters are generated which can improve the efficiency of
clustering algorithms. So, it becomes essential step before clustering
as Euclideandistance is very sensitive to the changes in the differences all
dimensions are equally important.
Here I’m using z-score to standardize the data to relative same scale -3 to
+3
Data before and after scaling:
14
1.3 Apply hierarchical clustering to scaled data. Identify the number of
optimum clusters using Dendrogram and briefly describe them.
Here I’m using all the three approaches:
1- Linkage Method
15
2-Ward Link Method:
16
17
3- Agglomerative Clustering:
Here I have shown the results for all the approaches, we can see there is
not much difference. As we know when we use different approaches
minute difference/ minor variations occurs.
For cluster grouping based on dendrograms, we can say 3 looks good. It
gives us the solution based on spending (high, medium, low).
We have cluster 1 as highest spending, cluster 2 as medium spending,
cluster 3 as lowest spending in linkage and cluster 0 as lowest spending in
Agglomerative.
18
1.4 Apply K-Means clustering on scaled data and determine optimum
clusters. Apply elbow curve and silhouette score. Explain the results
properly. Interpret and write inferences on the finalized clusters.
19
Silhouette score.
20
From the above graph and silhouette score 3-4 is optimal number of
clustering.
21
22
Here I’m going with 3 group clustering via kmeans, as it makes sense
based on spending pattern (high, medium, low).
23
3 group cluster via hierarchical clustering
Cluster Profile:
Group1: Highest Spending
Group2: Medium Spending
Group3: Lowest Spending.
Promotional strategies for different clusters:
Group1: Highest Spending Group
• Group 1 people are spending more money and also advance
payment done is high as compared to other two clusters.so these
people are the main target.
• As the advance payment is also high, Increase the credit limit, give
loans on their credit cards, as they are the customers with good
payment records.
• Giving reward points might attract them, and increase purchases.
• Also providing with the discounted offers on next transaction for
one-time full payment will be beneficial, as
max_spent_in_single_shopping is high.
24
Group2: Medium Spending Group
• These are potential target customers, who are paying bills, doing
purchases and maintaining, good credit score. So, here we can
increase the credit limit.
• Also providing some discounts / offers will increase the purchase.
• As from the cluster 3 group these set of people also have 2nd highest
advanced payment done, here also we can recommend to give loans
on their credit cards.
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
25
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim
frequency. The management decides to collect data from the past few
years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF &
ANN and compare the models' performances in train and test sets.
Attribute Information:
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency_Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration)
7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)
9. The commission received for tour insurance firm (Commission)
10. Age of insured (Age)
2.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis).
26
• There is total 3000 numbers of rows and 10 number of columns.
• No null entries present in it.
• Age, Commission, Duration, Sales have numeric datatypes, rest all
have object datatype.
• There is total 9 independent variables and 1 target variable(claimed).
27
Here, we have negative entries, which we can say might be a wrong
entry.
Getting Unique Values For categorical variables
28
Check for duplicates:
29
Here there is outliers in all variables , as sales and commission can have extrem values .
Random forest and Cart model can handel this , so not treating the ouliers now.
We will treat the outliers while ANN model.
30
Categorical Variables:
31
Checking pairwise distribution of the continuous variables:
32
Checking for Correlations:
Here we can say that there not strong correlation between the variables.
Just sales and commission have a correlation of 0.77.
As the sales increases, commission also increases.
33
34
Again, checking the info (), all object datatypes are converted to numeric
datatype(int).
35
All the required libraries have been imported.
Extracting the target column into separate vectors for training set and test
set
36
Data after scaling:
37
Generating a Tree:
http://webgraphviz.com/
38
looking at the above important parameters the model highly depends
upon at "Agency Code" i.e.,63.41% and "Sales" i.e.,22%.
Model Evaluation
AUC and ROC for the training data
39
AUC and ROC for the Testing Data
40
Confusion Matrix for training data-dtcl
41
Cart Conclusion:
Train Data:
AUC:82%
Accuracy:79%
Precision:70%
F1-score:60%
Test Data:
AUC:80%
Accuracy:77%
Precision:67%
F1-score:58%
Training and Test set results are almost similar, and with the overall
measures high, the model is a good model. Agency_code is the most
important variable for predicting insurance claimed.
42
Predicting the Training and Testing data:
43
RF Model Performance Evaluation on Training data:
44
RF Model Performance Evaluation on Test data:
45
Random Forest Conclusion
Train Data:
AUC:86%
Accuracy:80%
Precision:71%
F1-score:65%
Test Data:
AUC:82%
Accuracy:77%
Precision:66%
F1-score:59%
Training and Test set results are almost similar, and with the overall
measures high, the model is a good model. Agency_code is the most
important variable for predicting insurance claimed.
46
Predicting the Training and Testing data:
47
NN Model Performance Evaluation on Test data:
48
Neural Network Conclusion:
Train Data:
AUC:82%
Accuracy:78%
Precision:68%
F1-score:59%
Test Data:
AUC:80%
Accuracy:77%
Precision:67%
F1-score:57%
Training and Test set results are almost similar, and with the overall
measures high, the model is a good model.
49
2.4 Final Model: Compare all the models and write an inference which
model is best/optimized.
50
ROC Curve for the 3 models on the Test data:
51
52