100% found this document useful (1 vote)
1K views

Predictive Modelling Project - Business Report

1) The document describes a predictive modelling project to predict customer churn for a telecom company using logistic regression, KNN, and Naive Bayes models. 2) Exploratory data analysis found some variables like data usage and customer service calls were skewed, and removed correlated variables like monthly charges. 3) The KNN model performed best with a sensitivity of 81% at predicting churn, outperforming logistic regression and Naive Bayes. 4) Key recommendations include improving customer service to reduce repeat calls and obtaining more representative data for more accurate insights.

Uploaded by

gagan verma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views

Predictive Modelling Project - Business Report

1) The document describes a predictive modelling project to predict customer churn for a telecom company using logistic regression, KNN, and Naive Bayes models. 2) Exploratory data analysis found some variables like data usage and customer service calls were skewed, and removed correlated variables like monthly charges. 3) The KNN model performed best with a sensitivity of 81% at predicting churn, outperforming logistic regression and Naive Bayes. 4) Key recommendations include improving customer service to reduce repeat calls and obtaining more representative data for more accurate insights.

Uploaded by

gagan verma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Predictive Modelling Project

Submitted by Gagan Verma


Problem Statement
Customer Churn is a burning problem for Telecom companies. In this project,
we simulate one such case of customer churn where we work on a data of
post-paid customers with a contract. The data has information about the
customer usage behaviour, contract details and the payment details. The data
also indicates which were the customers who cancelled their service. Based
on this past data, we need to build a model which can predict whether a
customer will cancel their service in the future or not.

1. Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, multicollinearity check write an inference on it.
2. Data Split: Split the data into test and train, build model Logistic regression,
KNN and Naive Bayes.
3. Performance Metrics: Check the performance of models using Confusion
Matrix.
4. Final Model: Compare all the model and write an inference which model is
best/optimized.
5. Inference: Basis on these predictions, what are the business insights and
recommendations.

Variable Description

Variables
Churn 1 if customer cancelled service, 0 if not
number of weeks customer has had active
Account Weeks account
ContractRenewal 1 if customer recently renewed contract, 0 if not
Data Plan 1 if customer has data plan, 0 if not
Data Usage gigabytes of monthly data usage
CustServCalls number of calls into customer service
DayMins average daytime minutes per month
DayCalls average number of daytime calls
Monthly Charge average monthly bill
OverageFee largest overage fee in last 12 months
RoamMins average number of roaming minutes
Exploratory Data Analysis
1. Check Data Structure
Below results show that there are 11 variables and 3333 observation in the
data, all variables are numeric.

Now let see the summary of dataset


From the below table, by looking at the median and the mean numbers, it
gives us an idea that Data Usage and CustServCalls are skewed. We will plot
the data to see further.
1. 14% of customers cancelled their service which is quite big number.
2. Account weeks, DayMins, DayCalls, Monthly Charge, Overage fee, and
Roam Mins variables are almost normally distributed.
3. 90% customers renewed their contract recently.
4. 72% customers not having data plan.
5. Data Usage and CustServcalls variables are skewed.

2. Check Missing Values

There is no missing variable in the data set.


3. Plot data to see the distribution

Univariate Analysis (Box Plot) for continuous variable


Boxplot of Accountweeks Variable

Accountweeks variable has outliners.


Boxplot of Data Usage Variable

Data Usage is skewed and has outliners.


Boxplot of CustServCalls Variable

CustServcalls variable is skewed and has outliners.

Boxplot of DayMins Variable

DayMins variable has outliners on both sides.


Boxplot of Day Calls Variable

DayCalls has outliners on both sides.

Boxplot of Monthly Charge Variable

Monthly Charge variable has outliners.


Boxplot of Overage fee Variable

Overage Fee has outliners on both sides.

Boxplot of Roam Mins Variable

Roam Mins has outliners on both sides.


Bar chart of churn variable

There are 483 Cancelled service out of 3333 cases. Which comes out to be
14.5% Claim ratio.
Bar chart of data plan variable

There are 2411 customers not having data plan.

Bar chart of Contract Renewal variable

There are 3010 customers have renewed contract.


Box Plot Churn vs CustServCalls

CustServCalls variable is significant role in customer churn as there are lots of


service calls has been made by customer which can be the reason of
customer churn.
Box Plot Churn vs Account Weeks

Account weeks variable is not significant role in customer churn.


Box Plot Churn vs Data Usage

Data Usage variable is not significant role in customer churn

Box Plot Churn vs Day Mins

Day Mins variable is not significant role in customer churn.


Box Plot Churn vs Day Calls

Day Calls variable is not significant role in customer churn.

Box Plot Churn vs Monthly Charge

Monthly Charge variable is not significant role in customer churn


Box Plot Churn vs Overage Fee

Overage Fee variable is not significant role in customer churn

Box Plot Churn vs Roam Mins

Roam Mins variable is not significant role in customer churn


Plot for Data Plan vs Churn

Data Plan variable is not significant role in customer churn

Plot for Contract Renewal vs Churn

Contract Renewal variable is not significant role in customer churn


4. Check for multicollinearity and its treatment.

1. Churn does not seem to be highly corelated with any of the variables.

2. Monthly Charge is also highly correlated with Data Usage, Data Plan and
Day Mins.

3. Data Usage and Data Plan are highly correlated.

Dataset can be free from multicollinearity after removing the variables Monthly
Charge and Data Usage.
All the VIF values are quite low after removing two variables (Monthly Charge
and Data Usage).
Now the multicollinearity is not affecting the dataset.

Churn ratio
Out of 3333 records, we have taken for analysis there are 483 churn cases.
Thus, the churn ratio is 14.5%.

Split the data in (70:30) Train and Test


We have divided the dataset into test and train with 30:70 ratio respective.
Train data has 14% churn ratio
Test data has 14% churn ratio.
Observation - We can see almost equal representation in both training and
testing set for dependent variable.
Model Building

Logistic Regression:
Logistic regression gives best result without irrelevant and correlated variables
i.e. Data Usage and Monthly Charges (analysis from VIF).

KNN Model

Normalize the data as KNN works only on normalized data.


Perform the KNN model with k as 3,5,7 and 9.
KNN works best with K=5 with the accuracy of 90%.

Naïve Bayes Model


Naïve Bayes model works on the assumption of independent variables.
Given sample of data has dependent variables. Hence, remove correlated
variables i.e. Data Usage and Monthly Charges to perform Naïve Bayes.
Confusion Matrix

Confusion Matrix for logistic regression:

Confusion Matrix for KNN:


Confusion Matrix for Naïve Bayes:

Model Comparison

Model Comparison
Parameters Logistic Regression KNN Naïve Bayes
Accuracy 85% 90% 86%
Sensitivity 15% 81% 57%
Specificity 97% 91% 88%
Balanced Accuracy 56% 86% 73%
Total Loss 49% 19% 42%

As we have seen that above data is imbalance so we can’t take accuracy as


model measure parameter so have to take sensitivity as measuring
parameter, above chart shows that KNN model is having largest sensitivity
value between all model so we can say that KNN model is the best for above
dataset.
KNN model is also having highest balanced accuracy and total loss which is
also helpful to detect which model is performing good for dataset.
Inference
Issues like customer service calls is coming as important variable as customers
are leaving the company due to large customers call to telecom company.
Company need to work on customer service calls which can resolve customers
issues while calls as soon possible so that customer doesn’t need to call
company again and again for their queries.

These cannot be final recommendation as the models were made on a small


dataset. We need to ask for more data with right sampling methodology so that
the data represents their overall business numbers.
We need to get more data from the business for further analysis.

You might also like