100% found this document useful (1 vote)
135 views19 pages

Customer Churn Prediction

The document describes a project to predict customer churn for a cellphone company using logistic regression models. It includes an exploratory data analysis of 3,333 customer records and 11 variables related to usage, payments and churn status. The analysis found outliers in several continuous variables and high correlation between some independent variables like monthly charge and usage. The objective is to build a logistic regression model on 70% of the data and use it to predict churn on the remaining 30% to evaluate the model's effectiveness.

Uploaded by

surajkadam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
135 views19 pages

Customer Churn Prediction

The document describes a project to predict customer churn for a cellphone company using logistic regression models. It includes an exploratory data analysis of 3,333 customer records and 11 variables related to usage, payments and churn status. The analysis found outliers in several continuous variables and high correlation between some independent variables like monthly charge and usage. The objective is to build a logistic regression model on 70% of the data and use it to predict churn on the remaining 30% to evaluate the model's effectiveness.

Uploaded by

surajkadam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Predictive Modelling Cellphone Customerchurn Prediction

Compiled By -
Monil Jhaveri
Rahul Joshi
Sapna Mehta
Suraj Kadam

Problem Description:
Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of
clients or customers. Telephone service companies, Internet service providers, pay TV companies, insurance
firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one
of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new
one. Companies from these sectors often have customer service branches which attempt to win back defecting
clients, because recovered long-term customers can be worth much more to a company than newly recruited
clients. Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn
occurs due to a decision by the customer to switch to another company or service provider, involuntary churn
occurs due to circumstances such as a customer’s relocation to a long-term care facility, death, or the
relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the
analytical models. Analysts
tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer
relationship which company’s control, such as how billing interactions are handled or how after-sales help is
provided.
Predictive analytics use churn prediction models that predict customer churn by assessing their propensity of
risk to churn. Since, these models generate a small prioritized list of potential defectors, they are effective at
focusing customer retention marketing programs on the subset of the customer base who are most vulnerable
to churn.

Project Objective:
In this project, we simulate one such case of customer churn where we work on data of post-paid customers
with a contract. The data has information about the customer usage behaviour, contract details and the
payment details. The data also indicates which were the customers who cancelled their service. Based on this
past data, we need to build a model which can predict whether a customer will cancel their service in the
future or not.
1. Data Preparation: Basic EDA, Outlier, Summary, Relation between independent variables etc.
2. Apply Logistic Regression, KNN and Naive Bayes Model to the dataset
3. Check various performance measures and find out the best model for this problem

Data Description:
Below are the variables we are going to use from Cellphone file to predict the Churn variable, and their
description of different variables.

Variables Description Variable Type

Churn 1 if customer cancelled service, 0 if not Categorical

AccountWeeks number of weeks customer has had active account Continuous

ContractRenewa 1 if customer recently renewed contract, 0 if not Categorical


l

DataPlan 1 if customer has data plan, 0 if not Categorical


DataUsage gigabytes of monthly data usage Continuous

CustServCalls number of calls into customer service Continuous

DayMins average daytime minutes per month Continuous

DayCalls average number of daytime calls Continuous

MonthlyCharge average monthly bill Continuous

OverageFee largest overage fee in last 12 months Continuous

RoamMins average number of roaming minutes Continuous

Exploratory Data Analysis – EDA :

Variable Names & Dimensions:

There are total 3333 observations and 11variables


Data Summary:

Check missing values:

Check for unique values

Observations:

 We can see that given dataset has 3333 observations and 11 Variables
 All the variable loaded are numeric which is not right, there are few categorical variables such as
Churn, ContractRenewal and DataPlan which we are going to convert them into categorical variable
 Looking at the summary, there seems to be outliers
 There are no missing data points in the dataset provided
 There are 11 variables, 1 Target and 10 independent variable. Among them Churn, ContractRenewal,
DataPlan has two unique values. CustServCalls Has 10 unique values
Univariate Analysis

Boxplot:

Observations:

From the above diagram, we can observe

 18 outlier Values in variable AccountWeeks


 11 outlier Values in variable DataUsage
 Huge amount of outlier Values in variable CustServCalls
 25 outlier Values in variable DayMins
 23 outlier Values in variable DayCalls
 Lot of outlier Values in variables MonthlyCharge, OverageFee and RoamMins
 All the continuous variable has outlier
 Among them CustServCalls has the most number of outlier (we might need to change it to factor later
in the report after building the model and verifying the performance). MonthlyCharge, RoamMins also
contains large number of outliers.
# Customer churn comparison

> countchurn= as.numeric(as.matrix(table(customerchurn$Churn))[2])


> propchurn= as.numeric(as.matrix(prop.table(table(customerchurn$Churn))[2]))
> cat("No. of customers cancelled the service:", toString(countchurn),"(",
+ round(propchurn*100,digits = 2),"%)\n")

No. of customers cancelled the service: 483 (14.49 %)

> countnotchurn= as.numeric(as.matrix(table(customerchurn$Churn))[1])


> propnotchurn= as.numeric(as.matrix(prop.table(table(customerchurn$Churn))[1]))
> cat("No. of customers not cancelled the service:", toString(countnotchurn),"(",
+ round(propnotchurn*100,digits = 2),"%)\n")
No. of customers not cancelled the service: 2850 (85.51 %)
Analyze the dependent variable - Churn

1. Barplot for Dependent variable – Churn

2. Checking distribution for continuous variables

 From the histograms, for the continuous variables we can determine the distribution of data.
 Excluding DataUsage, all remaining variables are normally distributed.
Bi-Variate Analysis:

 Checking the frequency distribution for categorical and discrete variables

Observation on independent variables

1. From the bar plot for Data Plan, it can be observed the majority of the cancelled customers does not have a
data plan (403 out of 483 cancellations i.e. 83.44%).

2. More number of customers made at least one phone call to customer service (1181 out of 3333 customer
i.e. 35.43%)
Compute correlation matrix

From the correlation plot, we can observe the existence of high correlation between the following variables,

1. DataPlan <-> DataUsage

2. DataPlan <-> MonthlyCharge

3. MonthlyCharge <-> DayMins

The dependent variable Churn is negatively correlated to the variable ContractRenewal


Multi-collinearity

Post EDA and treatment of outliers in the data, we look at multi-collinearity amongst the independent
variables and use the correlation matrix amongst the numeric variables to ascertain the same. We remove the
dependent as well as factor variables i.e. Churn, Contract Renewal and Data Plan and run the correlation
matrix on the remaining variables with the output as follows

From the above matrix, we see that Monthly Charges has a high correlation with other independent variables
such as Data Usage (78%), Day Mins (57%), Overage Fee (28%) and Roam Mins (11%).

Hence, we will consider dropping this variable when running logistics regression on the revised model

#Treating outliers in Data Usage

> qnt1 = quantile(AccountWeeks, probs=c(.25, .75), na.rm = T)


> WhiskerLength = 1.5 * IQR(AccountWeeks)
> upperCap = qnt1[2]+WhiskerLength
> Cellphone$AccountWeeks[which(Cellphone$AccountWeeks>upperCap)] = upperCap
> boxplot(Cellphone$AccountWeeks, horizontal=T)

We have capped the lower limit to .25 and upper limit to .75 for all the variables having outliers. Hence, every
outlier has been treated before running the logistic regression model.

Logistics Regression

The problem statement involves predicting whether a customer will cancel their service in the future or not.
Hence, we use Logistics Regression to ascertain/predict the output as a categorical variable.

This is binomial logistics regression model where we first use Customer Churn as the dependent/output
variable with all other variables being independent variables.

We also use the concept of training and testing data to go about our logistic regression analysis. We treat 70%
of the data as training data with the rest 30% data acting as testing data to be able to test how good or bad the
model is.

We now generate the logistic regression model as defined above using training data and conduct the following
steps to test the effectiveness and accuracy of the model
I) Logistics Regression Model with all variables

1) Checking validity of the overall model through likelihood ratio test. Null Hypothesis is that logistics
regression model is not valid. Alternate hypothesis being the model is valid.
Basis the below p-value, since the same is less than alpha (0.05 at a 95% confidence level), we reject
null hypothesis and ascertain that the model is valid

2) We use McFadden’s R square test to determine the strength of the predictability of the model
generated. We typically interpret the McFadden’s R square value is follows

McFadden R square value Model predictive capability


>0.1 and <0.2 Acceptable
>0.2 and <0.3 Good
>0.3 and <0.4 Very Good
>0.4 Excellent

Basis the McFadden R square value that we obtain (0.195), we ascertain the model to be acceptable
(close to good).

3) Check significance of each variable in the logistics regression model generated


We check the p-value for each variable where Null Hypothesis is that the independent variable is not
significant in predicting the dependent variable. Hence if p-value is less than alpha (0.05), we treat the
variable as being significant and vice versa. Please see below the output of the summary of the model.
To interpret the above output, we can see that the variables that look significant for prediction in the
logit model are Contract Renewal, Customer Service Calls and Roaming Minutes. In terms of business
interpretation, we can see that whether the customer has recently renewed their contract, number of
customer service calls made and average number of roaming minutes have a significant influence on
whether the customer will cancel the service or not with the first two factors being the most
significant. While we do not rule out all variables which have a p-value greater than 0.05 immediately,
we remove them one by one (starting with the highest p-value) and keep revising the model to see if
there are any new variables that are significant as well. We will do this subsequently when we will
revise our logistics regression model with only relevant variables

4) The most important step now is to predict the probabilities of customer churn using training data and
then assessing the performance on testing data. For this we generate predicted values from our
logistic regression model and convert them to 0 if less than cut-off and 1 if greater than cut-off. We
then create a confusion matrix to compare the actual values of customer churn vs the ones predicted
by our model. Typically, we consider a cut-off of 0.5 initially and then revise depending on business
requirements. We get the following confusion matrix.

Predicted
0 1
0 842 24
Actual
1 103 31

Accuracy: (842+31)/(842+24+103+31) = 87.3%

While we see the accuracy being quite high at 87.3%, the model is not very good as our aim is to
maximise the correct predictions of users who cancel the service.
We use sensitivity to ascertain the same, followed by specificity to ascertain the correctly predicted
customers who did not cancel the service and then precision to understand the correctly predicted
customers who cancel the service out of the total predictions made for customers who cancel the
service.

Sensitivity: 31/(103+31) = 23.1%


Specificity: 842/(842+24) = 97.2%
Precision: 31/(31+24) = 56.4%

From the above we see that Sensitivity is very low and our aim from the problem statement is to
maximise the same while sacrificing the accuracy to a small extent. Hence, we consider increasing the
cut-off and analyse the confusion matrix with a cut-off of 0.8 instead of 0.5 and calculate the same
values

Predicted
0 1
0 724 142
Actual
1 37 97

Accuracy: 82.1%
Sensitivity: 72.4%
Specificity: 83.6%
Precision: 40.6%
While we see that we have compromised on our accuracy slightly (82.1% is still good), we have
increased sensitivity to 72.4% thus increasing the goodness of model significantly in terms of the
business problem on hand

5) We now plot the ROC curve between Sensitivity and 1-Specificity to again ascertain the goodness of
fit of the logit model generated. Area under the ROC curve being greater than 0.7 is considered good
with a greater value being a better one. We obtain the area under the ROC curve of 0.847 which
shows the model as being a good fit and an accurate indicator for the business problem i.e. ability to
predict customers who will cancel the service in the future or not.

II) Logistics Regression model with relevant variables only


We first remove Monthly Charges as it had the highest multi-collinearity with other independent
variables. Post which we run the summary of logit model and step by step remove the variables that
have a p-value greater than alpha. We end up removing Data Usage, Day Calls and Account Weeks as
variables which are determined as insignificant variables.
Hence our revised model takes into consideration the following independent variables

Variables
ContractRenewal 1 if customer recently renewed contract, 0 if not
DataPlan 1 if customer has data plan, 0 if not
CustServCalls number of calls into customer service
DayMins average daytime minutes per month
OverageFee largest overage fee in last 12 months
RoamMins average number of roaming minutes

1) We check multi-collinearity on the revised model using Variance Inflation Factor (VIF) method, where
a VIF value less than 0.5 is considered to show as low multi-collinearity. We see that in our model, all
variables have a VIF value less than 2, thus confirming low multi-collinearity in our model
2) Checking validity of the overall model through likelihood ratio test. Null Hypothesis is that logistics
regression model is not valid. Alternate hypothesis being the model is valid.
Basis the below p-value, since the same is less than alpha (0.05 at a 95% confidence level), we reject
null hypothesis and ascertain that the model is valid

3) We use McFadden’s R square test to determine the strength of the predictability of the model
generated. We typically interpret the McFadden’s R square value is follows

McFadden R square value Model predictive capability


>0.1 and <0.2 Acceptable
>0.2 and <0.3 Good
>0.3 and <0.4 Very Good
>0.4 Excellent

Basis the McFadden R square value that we obtain (0.194), we ascertain the model to be acceptable
(close to good).

4) Since we have now considered only significant variables, we verify the same by looking at the
summary of the revised logistic regression model. We now see that all the variables have a p-value
less than alpha (0.05) and hence are all considered significant for our logistic regression model
5) The most important step again is to predict the probabilities of customer churn using training data
and then assessing the performance on testing data. For this we generate predicted values from our
logistic regression model and convert them to 0 if less than cut-off and 1 if greater than cut-off. We
then create a confusion matrix to compare the actual values of customer churn vs the ones predicted
by our model. We again consider a cut-off of 0.5 initially and then revise depending on the output vs
what is required from our business statement. We get the following confusion matrix at a cut-off of
0.5

Predicted
0 1
0 841 25
Actual
1 105 29

Accuracy: 87%
Sensitivity: 21.6%
Specificity: 97.1%
Precision: 53.7%

From the above we again see that Sensitivity is very low and our aim from the problem statement is
to maximise the same while sacrificing the accuracy to a small extent. Hence, we consider increasing
the cut-off again and analyse the confusion matrix with a cut-off of 0.8 instead of 0.5 and calculate the
same values

Predicted
0 1
0 720 146
Actual
1 36 98

Accuracy: 81.8%
Sensitivity: 73.1%
Specificity: 83.1%
Precision: 40.2%

While we see that we have compromised on our accuracy slightly (81.8% is still good), we have
increased sensitivity to 73.1% thus increasing the goodness of model significantly in terms of the
business problem on hand. In our revised model, we obtain a higher value of sensitivity.

6) We now plot the ROC curve between Sensitivity and 1-Specificity to again ascertain the goodness of
fit of the logit model generated. Area under the ROC curve being greater than 0.7 is considered good
with a greater value being a better one. We obtain the area under the ROC curve of 0.845 which
shows the model as being a good fit and an accurate indicator for the business problem i.e. ability to
predict customers who will cancel the service in the future or not.
Applying and Interpreting KNN Model:

> table.knn
knn.pred
0 1
0 1016 18
1 132 46
> sum(diag(table.knn)/sum(table.knn))
[1] 0.8762376
> confusionMatrix(table.knn)
Confusion Matrix and Statistics

knn.pred
0 1
0 1016 18
1 132 46

Accuracy : 0.8762
95% CI : (0.8564, 0.8943)
No Information Rate : 0.9472
P-Value [Acc > NIR] : 1

Kappa : 0.328

Mcnemar's Test P-Value : <2e-16

Sensitivity : 0.8850
Specificity : 0.7188
Pos Pred Value : 0.9826
Neg Pred Value : 0.2584
Prevalence : 0.9472
Detection Rate : 0.8383
Detection Prevalence : 0.8531
Balanced Accuracy : 0.8019

'Positive' Class : 0

From above result we can see that accuracy is maximum 87.62%

Graphical representation on KNN:

Model is doing good in test dataset as well in terms of sensitivity and accuracy. Area under curve is 84.53%
approx.
Applying and Interpreting Naive Bayes Model:

> confusionMatrix(tab.NB)
Confusion Matrix and Statistics

predNB
0 1
0 840 26
1 98 36

Accuracy : 0.876
95% CI : (0.854, 0.8958)
No Information Rate : 0.938
P-Value [Acc > NIR] : 1

Kappa : 0.3087

Mcnemar's Test P-Value : 1.818e-10

Sensitivity : 0.8955
Specificity : 0.5806
Pos Pred Value : 0.9700
Neg Pred Value : 0.2687
Prevalence : 0.9380
Detection Rate : 0.8400
Detection Prevalence : 0.8660
Balanced Accuracy : 0.7381

'Positive' Class : 0

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
0 1
0.8504072 0.1495928

Conditional probabilities:
ContractRenewal
Y 0 1
0 0.06502016 0.93497984
1 0.28653295 0.71346705

DataPlan
Y 0 1
0 0.7076613 0.2923387
1 0.8280802 0.1719198

CustServCalls
Y [,1] [,2]
0 1.449597 1.162743
1 2.209169 1.873692
DayMins
Y [,1] [,2]
0 175.7338 49.78610
1 204.1471 69.59277

OverageFee
Y [,1] [,2]
0 9.952092 2.521949
1 10.543324 2.575289

RoamMins
Y [,1] [,2]
0 10.16522 2.740958
1 10.70458 2.650197

 We can see A-priori probabilities of Churn value in train dataset, which is 85% and 15%.
 Conditional probabilities can be interpreted like, probability of ContractRenewal given probability of
Churn. Conditional distribution of ContractRenewal given 0 has a mean of 0.27 and standard
deviation(sd) 0.71.
 Naive Bayes requires to specify these distributions but it does not require to specify joint distribution.

Interpretation of other Model Performance Measures for logistic regression (KS, AUC, GINI):

After compiling the code below values are obtained for KS,AUC and GINI, from which we can see that
performance measure is relatively better in Test dataset as compared to Train after treating the outliers.

Performance Measures Train_Score Test_Score


KS 0.5108 0.5901
AUC 0.8094 0.8453
GINI 0.5047 0.509
If the company works out better offer plan on the significant variables (ContractRenewal, DataPlan,
CustServCalls, DayMins, OverageFee and RoamMins) there will be high possibility of retaining the customers

Model Comparison:

Performance Measures Logistics Regression KNN NB

TRAIN DATA TEST DATA TEST DATA TEST DATA

Accuracy 87% 81.8% 90% 87%

Sensitivity 21.6% 73.1% 90.60% 89%

Specificity 97.1% 83.1% 79% 58%

Confusion Matrix   0  1   0 1    0 1    0 1


0  824   24 0  720   146 0 1041   13 0 840     26
1  103   31 1    36 98 1    108 50 1    98 36
       
   
   
   
If we compare the models Logistic Regression, KNN and NB, it is observed that the sensitivity of KNN model is
high i.e. 90% however while when we observe the confusion matrix the Logistic Regression model seems to be
more accurate with True Negative being higher than the other models. The business idea is to predict
customers who may cancel the services.
Business Interpretation

We see that the Logistics Regression model is made into a better fit by increasing the cut-off values.
The validity and fitness of the model is good as ascertained above and the significant variables are
ascertained using p-values generated from the model thus helping us focus on more important factors
that will help retain customers.
From our model, we see that recency of contract renewal, number of calls made to customer service
and average day time minutes during the month are the most significant variables for customer churn
and hence we can focus on these factors to retain more customers.
We can offer better data plans and offers on calls made during the day, and also create better
customer service maybe through a chat feature on a user-friendly website/app ensuring customers
need to make minimum calls to customer service and keep renewing their contracts on a continuous
basis.

You might also like