Customer Churn Prediction
Customer Churn Prediction
Compiled By -
Monil Jhaveri
Rahul Joshi
Sapna Mehta
Suraj Kadam
Problem Description:
Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of
clients or customers. Telephone service companies, Internet service providers, pay TV companies, insurance
firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one
of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new
one. Companies from these sectors often have customer service branches which attempt to win back defecting
clients, because recovered long-term customers can be worth much more to a company than newly recruited
clients. Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn
occurs due to a decision by the customer to switch to another company or service provider, involuntary churn
occurs due to circumstances such as a customer’s relocation to a long-term care facility, death, or the
relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the
analytical models. Analysts
tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer
relationship which company’s control, such as how billing interactions are handled or how after-sales help is
provided.
Predictive analytics use churn prediction models that predict customer churn by assessing their propensity of
risk to churn. Since, these models generate a small prioritized list of potential defectors, they are effective at
focusing customer retention marketing programs on the subset of the customer base who are most vulnerable
to churn.
Project Objective:
In this project, we simulate one such case of customer churn where we work on data of post-paid customers
with a contract. The data has information about the customer usage behaviour, contract details and the
payment details. The data also indicates which were the customers who cancelled their service. Based on this
past data, we need to build a model which can predict whether a customer will cancel their service in the
future or not.
1. Data Preparation: Basic EDA, Outlier, Summary, Relation between independent variables etc.
2. Apply Logistic Regression, KNN and Naive Bayes Model to the dataset
3. Check various performance measures and find out the best model for this problem
Data Description:
Below are the variables we are going to use from Cellphone file to predict the Churn variable, and their
description of different variables.
Observations:
We can see that given dataset has 3333 observations and 11 Variables
All the variable loaded are numeric which is not right, there are few categorical variables such as
Churn, ContractRenewal and DataPlan which we are going to convert them into categorical variable
Looking at the summary, there seems to be outliers
There are no missing data points in the dataset provided
There are 11 variables, 1 Target and 10 independent variable. Among them Churn, ContractRenewal,
DataPlan has two unique values. CustServCalls Has 10 unique values
Univariate Analysis
Boxplot:
Observations:
From the histograms, for the continuous variables we can determine the distribution of data.
Excluding DataUsage, all remaining variables are normally distributed.
Bi-Variate Analysis:
1. From the bar plot for Data Plan, it can be observed the majority of the cancelled customers does not have a
data plan (403 out of 483 cancellations i.e. 83.44%).
2. More number of customers made at least one phone call to customer service (1181 out of 3333 customer
i.e. 35.43%)
Compute correlation matrix
From the correlation plot, we can observe the existence of high correlation between the following variables,
Post EDA and treatment of outliers in the data, we look at multi-collinearity amongst the independent
variables and use the correlation matrix amongst the numeric variables to ascertain the same. We remove the
dependent as well as factor variables i.e. Churn, Contract Renewal and Data Plan and run the correlation
matrix on the remaining variables with the output as follows
From the above matrix, we see that Monthly Charges has a high correlation with other independent variables
such as Data Usage (78%), Day Mins (57%), Overage Fee (28%) and Roam Mins (11%).
Hence, we will consider dropping this variable when running logistics regression on the revised model
We have capped the lower limit to .25 and upper limit to .75 for all the variables having outliers. Hence, every
outlier has been treated before running the logistic regression model.
Logistics Regression
The problem statement involves predicting whether a customer will cancel their service in the future or not.
Hence, we use Logistics Regression to ascertain/predict the output as a categorical variable.
This is binomial logistics regression model where we first use Customer Churn as the dependent/output
variable with all other variables being independent variables.
We also use the concept of training and testing data to go about our logistic regression analysis. We treat 70%
of the data as training data with the rest 30% data acting as testing data to be able to test how good or bad the
model is.
We now generate the logistic regression model as defined above using training data and conduct the following
steps to test the effectiveness and accuracy of the model
I) Logistics Regression Model with all variables
1) Checking validity of the overall model through likelihood ratio test. Null Hypothesis is that logistics
regression model is not valid. Alternate hypothesis being the model is valid.
Basis the below p-value, since the same is less than alpha (0.05 at a 95% confidence level), we reject
null hypothesis and ascertain that the model is valid
2) We use McFadden’s R square test to determine the strength of the predictability of the model
generated. We typically interpret the McFadden’s R square value is follows
Basis the McFadden R square value that we obtain (0.195), we ascertain the model to be acceptable
(close to good).
4) The most important step now is to predict the probabilities of customer churn using training data and
then assessing the performance on testing data. For this we generate predicted values from our
logistic regression model and convert them to 0 if less than cut-off and 1 if greater than cut-off. We
then create a confusion matrix to compare the actual values of customer churn vs the ones predicted
by our model. Typically, we consider a cut-off of 0.5 initially and then revise depending on business
requirements. We get the following confusion matrix.
Predicted
0 1
0 842 24
Actual
1 103 31
While we see the accuracy being quite high at 87.3%, the model is not very good as our aim is to
maximise the correct predictions of users who cancel the service.
We use sensitivity to ascertain the same, followed by specificity to ascertain the correctly predicted
customers who did not cancel the service and then precision to understand the correctly predicted
customers who cancel the service out of the total predictions made for customers who cancel the
service.
From the above we see that Sensitivity is very low and our aim from the problem statement is to
maximise the same while sacrificing the accuracy to a small extent. Hence, we consider increasing the
cut-off and analyse the confusion matrix with a cut-off of 0.8 instead of 0.5 and calculate the same
values
Predicted
0 1
0 724 142
Actual
1 37 97
Accuracy: 82.1%
Sensitivity: 72.4%
Specificity: 83.6%
Precision: 40.6%
While we see that we have compromised on our accuracy slightly (82.1% is still good), we have
increased sensitivity to 72.4% thus increasing the goodness of model significantly in terms of the
business problem on hand
5) We now plot the ROC curve between Sensitivity and 1-Specificity to again ascertain the goodness of
fit of the logit model generated. Area under the ROC curve being greater than 0.7 is considered good
with a greater value being a better one. We obtain the area under the ROC curve of 0.847 which
shows the model as being a good fit and an accurate indicator for the business problem i.e. ability to
predict customers who will cancel the service in the future or not.
Variables
ContractRenewal 1 if customer recently renewed contract, 0 if not
DataPlan 1 if customer has data plan, 0 if not
CustServCalls number of calls into customer service
DayMins average daytime minutes per month
OverageFee largest overage fee in last 12 months
RoamMins average number of roaming minutes
1) We check multi-collinearity on the revised model using Variance Inflation Factor (VIF) method, where
a VIF value less than 0.5 is considered to show as low multi-collinearity. We see that in our model, all
variables have a VIF value less than 2, thus confirming low multi-collinearity in our model
2) Checking validity of the overall model through likelihood ratio test. Null Hypothesis is that logistics
regression model is not valid. Alternate hypothesis being the model is valid.
Basis the below p-value, since the same is less than alpha (0.05 at a 95% confidence level), we reject
null hypothesis and ascertain that the model is valid
3) We use McFadden’s R square test to determine the strength of the predictability of the model
generated. We typically interpret the McFadden’s R square value is follows
Basis the McFadden R square value that we obtain (0.194), we ascertain the model to be acceptable
(close to good).
4) Since we have now considered only significant variables, we verify the same by looking at the
summary of the revised logistic regression model. We now see that all the variables have a p-value
less than alpha (0.05) and hence are all considered significant for our logistic regression model
5) The most important step again is to predict the probabilities of customer churn using training data
and then assessing the performance on testing data. For this we generate predicted values from our
logistic regression model and convert them to 0 if less than cut-off and 1 if greater than cut-off. We
then create a confusion matrix to compare the actual values of customer churn vs the ones predicted
by our model. We again consider a cut-off of 0.5 initially and then revise depending on the output vs
what is required from our business statement. We get the following confusion matrix at a cut-off of
0.5
Predicted
0 1
0 841 25
Actual
1 105 29
Accuracy: 87%
Sensitivity: 21.6%
Specificity: 97.1%
Precision: 53.7%
From the above we again see that Sensitivity is very low and our aim from the problem statement is
to maximise the same while sacrificing the accuracy to a small extent. Hence, we consider increasing
the cut-off again and analyse the confusion matrix with a cut-off of 0.8 instead of 0.5 and calculate the
same values
Predicted
0 1
0 720 146
Actual
1 36 98
Accuracy: 81.8%
Sensitivity: 73.1%
Specificity: 83.1%
Precision: 40.2%
While we see that we have compromised on our accuracy slightly (81.8% is still good), we have
increased sensitivity to 73.1% thus increasing the goodness of model significantly in terms of the
business problem on hand. In our revised model, we obtain a higher value of sensitivity.
6) We now plot the ROC curve between Sensitivity and 1-Specificity to again ascertain the goodness of
fit of the logit model generated. Area under the ROC curve being greater than 0.7 is considered good
with a greater value being a better one. We obtain the area under the ROC curve of 0.845 which
shows the model as being a good fit and an accurate indicator for the business problem i.e. ability to
predict customers who will cancel the service in the future or not.
Applying and Interpreting KNN Model:
> table.knn
knn.pred
0 1
0 1016 18
1 132 46
> sum(diag(table.knn)/sum(table.knn))
[1] 0.8762376
> confusionMatrix(table.knn)
Confusion Matrix and Statistics
knn.pred
0 1
0 1016 18
1 132 46
Accuracy : 0.8762
95% CI : (0.8564, 0.8943)
No Information Rate : 0.9472
P-Value [Acc > NIR] : 1
Kappa : 0.328
Sensitivity : 0.8850
Specificity : 0.7188
Pos Pred Value : 0.9826
Neg Pred Value : 0.2584
Prevalence : 0.9472
Detection Rate : 0.8383
Detection Prevalence : 0.8531
Balanced Accuracy : 0.8019
'Positive' Class : 0
Model is doing good in test dataset as well in terms of sensitivity and accuracy. Area under curve is 84.53%
approx.
Applying and Interpreting Naive Bayes Model:
> confusionMatrix(tab.NB)
Confusion Matrix and Statistics
predNB
0 1
0 840 26
1 98 36
Accuracy : 0.876
95% CI : (0.854, 0.8958)
No Information Rate : 0.938
P-Value [Acc > NIR] : 1
Kappa : 0.3087
Sensitivity : 0.8955
Specificity : 0.5806
Pos Pred Value : 0.9700
Neg Pred Value : 0.2687
Prevalence : 0.9380
Detection Rate : 0.8400
Detection Prevalence : 0.8660
Balanced Accuracy : 0.7381
'Positive' Class : 0
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
0 1
0.8504072 0.1495928
Conditional probabilities:
ContractRenewal
Y 0 1
0 0.06502016 0.93497984
1 0.28653295 0.71346705
DataPlan
Y 0 1
0 0.7076613 0.2923387
1 0.8280802 0.1719198
CustServCalls
Y [,1] [,2]
0 1.449597 1.162743
1 2.209169 1.873692
DayMins
Y [,1] [,2]
0 175.7338 49.78610
1 204.1471 69.59277
OverageFee
Y [,1] [,2]
0 9.952092 2.521949
1 10.543324 2.575289
RoamMins
Y [,1] [,2]
0 10.16522 2.740958
1 10.70458 2.650197
We can see A-priori probabilities of Churn value in train dataset, which is 85% and 15%.
Conditional probabilities can be interpreted like, probability of ContractRenewal given probability of
Churn. Conditional distribution of ContractRenewal given 0 has a mean of 0.27 and standard
deviation(sd) 0.71.
Naive Bayes requires to specify these distributions but it does not require to specify joint distribution.
Interpretation of other Model Performance Measures for logistic regression (KS, AUC, GINI):
After compiling the code below values are obtained for KS,AUC and GINI, from which we can see that
performance measure is relatively better in Test dataset as compared to Train after treating the outliers.
Model Comparison:
We see that the Logistics Regression model is made into a better fit by increasing the cut-off values.
The validity and fitness of the model is good as ascertained above and the significant variables are
ascertained using p-values generated from the model thus helping us focus on more important factors
that will help retain customers.
From our model, we see that recency of contract renewal, number of calls made to customer service
and average day time minutes during the month are the most significant variables for customer churn
and hence we can focus on these factors to retain more customers.
We can offer better data plans and offers on calls made during the day, and also create better
customer service maybe through a chat feature on a user-friendly website/app ensuring customers
need to make minimum calls to customer service and keep renewing their contracts on a continuous
basis.