Telecom Customer Churn
Telecom Customer Churn
Salma Mohiuddin
24/11/2019
Introduction
Customer Churn is a burning problem for Telecom companies. In this project, we simulate
one such case of customer churn where we work on a data of postpaid customers with a
contract. The data has information about the customer usage behavior, contract details and
the payment details. The data also indicates which were the customers who canceled their
service. Based on this past data, we need to build a model which can predict whether a
customer will cancel their service in the future or not.
library(psych)
##
## Attaching package: 'psych'
library(DataExplorer)
library(car)
##
## Attaching package: 'car'
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
library(purrr)
##
## Attaching package: 'purrr'
library(caret)
##
## Attaching package: 'caret'
library(contrast)
##
## Attaching package: 'survival'
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
##
## Attaching package: 'SparseM'
##
## Attaching package: 'rms'
library(miscset)
##
## Attaching package: 'miscset'
library(ROCR)
##
## Attaching package: 'gplots'
library(class)
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
MyOrigData = read.csv("CellphoneCSV.csv",header=TRUE)
attach(MyOrigData)
head(MyOrigData)
tail(MyOrigData)
## [1] 3333 11
describe(MyOrigData)
## MyOrigData
##
## 11 Variables 3333 Observations
## --------------------------------------------------------------------------
-
## Churn
## n missing distinct Info Sum Mean Gmd
## 3333 0 2 0.372 483 0.1449 0.2479
##
## --------------------------------------------------------------------------
-
## AccountWeeks
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 212 1 101.1 45.01 35 50
## .25 .50 .75 .90 .95
## 74 101 127 152 167
##
## lowest : 1 2 3 4 5, highest: 221 224 225 232 243
## --------------------------------------------------------------------------
-
## ContractRenewal
## n missing distinct Info Sum Mean Gmd
## 3333 0 2 0.263 3010 0.9031 0.1751
##
## --------------------------------------------------------------------------
-
## DataPlan
## n missing distinct Info Sum Mean Gmd
## 3333 0 2 0.6 922 0.2766 0.4003
##
## --------------------------------------------------------------------------
-
## DataUsage
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 174 0.839 0.8165 1.202 0.00 0.00
## .25 .50 .75 .90 .95
## 0.00 0.00 1.78 3.05 3.46
##
## lowest : 0.00 0.11 0.12 0.13 0.14, highest: 4.59 4.64 4.73 4.75 5.40
## --------------------------------------------------------------------------
-
## CustServCalls
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 10 0.932 1.563 1.392 0 0
## .25 .50 .75 .90 .95
## 1 1 2 3 4
##
## lowest : 0 1 2 3 4, highest: 5 6 7 8 9
##
## Value 0 1 2 3 4 5 6 7 8 9
## Frequency 697 1181 759 429 166 66 22 9 2 2
## Proportion 0.209 0.354 0.228 0.129 0.050 0.020 0.007 0.003 0.001 0.001
## --------------------------------------------------------------------------
-
## DayMins
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 1667 1 179.8 61.46 89.92 110.32
## .25 .50 .75 .90 .95
## 143.70 179.40 216.40 249.58 270.74
##
## lowest : 0.0 2.6 7.8 7.9 12.5, highest: 335.5 337.4 345.3 346.8
350.8
## --------------------------------------------------------------------------
-
## DayCalls
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 119 1 100.4 22.59 67.0 74.2
## .25 .50 .75 .90 .95
## 87.0 101.0 114.0 126.0 133.0
##
## lowest : 0 30 35 36 40, highest: 157 158 160 163 165
## --------------------------------------------------------------------------
-
## MonthlyCharge
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 627 1 56.31 18.35 33.26 38.00
## .25 .50 .75 .90 .95
## 45.00 53.50 66.20 80.50 87.80
##
## lowest : 14.0 15.7 16.0 17.0 19.0, highest: 108.3 108.6 108.7 110.0
111.3
## --------------------------------------------------------------------------
-
## OverageFee
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 1024 1 10.05 2.86 5.94 6.84
## .25 .50 .75 .90 .95
## 8.33 10.07 11.77 13.29 14.22
##
## lowest : 0.00 1.56 2.11 2.13 2.20, highest: 17.55 17.58 17.71 18.09
18.19
## --------------------------------------------------------------------------
-
## RoamMins
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 162 1 10.24 3.114 5.7 6.7
## .25 .50 .75 .90 .95
## 8.5 10.3 12.1 13.7 14.7
##
## lowest : 0.0 1.1 1.3 2.0 2.1, highest: 18.2 18.3 18.4 18.9 20.0
## --------------------------------------------------------------------------
-
Univariate variables
. Plotting histogram for all the variables in the dataset.
. The independent variables, “AccountWeeks”, “DayCalls”, “DayMins”, “MonthlyCharges”,
“OverageFee” anad “RoamMins” are normally distributed
. “CustServCalls” is skewed to the left and indicates that there have been more number of
customers making a few calls where as a few customers making more number of calls to
customer service which shows the dissatisfaction level.
. “DataUsage” has normal distribution around the middle and some skewness on the left
highlighting either no data usage or low data usage by the majority of the customers who
maybe getting their data usage outside this service provider
plot_histogram(MyOrigData, title = "Histogram of all the variables")
. Also, from the data graph exploration below, we can see that those customers who have
not renewed their contract are more likely (about 74%) to churn when compared to those
who renew their contracts so we need to try and get the renewals done though a follow up
to retain the customers
ggplot(MyOrigData) +
geom_bar(aes(x = ContractRenewal, fill = Churn), position = "dodge")
. And, those customers who do not have a dataplan are more likely (about 20%) to churn
when compared to those who have a dataplan so focus needs to also be given on providing
some offers so the existing customers are retained that will result in low cost
ggplot(MyOrigData) +
geom_bar(aes(x = DataPlan, fill = Churn), position = "dodge")
. Similarly, we can see that those customers who have made more calls to the customer
service have been most dissatisfied with the service therefore increasing the likelyhood of
churning with increased number of calls
. On the average, we can see the customemrs who have made 4 calls were about 50% likely
to churn and customer making 9 calls were 100% likely to churn therefore it is important
to look at the concerns that are a cause of dissatisfaction and this will help in retaining the
customers
ggplot(MyOrigData) +
geom_bar(aes(x = CustServCalls, fill = Churn), position = "dodge")
Multivariate variables
. The correlation matrix table is showing the correlation coefficients between the variables.
Each cell in the table shows the correlation between two variables. From what we see on
the correlation matrix below
. The DataPlan and DataUsage show high positive correlation with each other both being
independent variables
. MonthlyCharges show high positive correlation with DataPlan and DataUsage and
moderate positive correlation with DayMins and low positive correlation with OverageFee
and RoamMins
. Churn show low positive correlation between CustServCalls and DayMins and low
negative correlation between ContractRenewal
corrplot(cor(MyOrigData))
Check for any missing Values
. None of the variables have missing values therefore no action is required and the dataset
is complete
MyOrigData %>% map(~ sum(is.na(.)))
## $Churn
## [1] 0
##
## $AccountWeeks
## [1] 0
##
## $ContractRenewal
## [1] 0
##
## $DataPlan
## [1] 0
##
## $DataUsage
## [1] 0
##
## $CustServCalls
## [1] 0
##
## $DayMins
## [1] 0
##
## $DayCalls
## [1] 0
##
## $MonthlyCharge
## [1] 0
##
## $OverageFee
## [1] 0
##
## $RoamMins
## [1] 0
Identifying Outliers
. Another useful visualization is the box and whisker plot. This provides us a little bit more
high level visual of our data, and help us identify the outliers if any. Lets take a look at some
box and whisker plots for alll the independent variables. we can see outliers
. Most of the values are concentrated around the median and values that are falling below
the 1st Quartile and the values higher than the 3rd Quartile appear as outliers. According to
me these don’t need to be treated
. The outlier on the MonthlyCharge are likely to be those customers where the usage has
been more therefore having some higher values
. Many customers have not used the service for many weeks therefore we have some very
extreme values highlighted as outliers
. We see outliers for DayMins and DayCalls as well. Moving them to a plan without phone
service to save them some money on their bill might help retain them.
boxplot(MonthlyCharge, DayMins, DayCalls,
AccountWeeks,OverageFee,RoamMins,DataUsage,
main = "Multiple boxplots for comparision",
at = c(1,2,4,5,6,7,8),
names = c("MonthlyCharge", "DayMins", "DayCalls",
"AccountWeeks","OverageFee","RoamMins","DataUsage" ),
col = "red",
border = "brown")
Variable Transformation
. We need to change the response variable “Chrun” from Integer to Factor to be able to
apply the classification model (Logistic Regression) to predict the customers who are likely
to Churn in the future.
. The rest of the variables are retained as is
MyOrigData$Churn <- as.factor(MyOrigData$Churn)
Summary of Churn
. As can be seen from the table below, 483 customers have churned in the past and 2850
have not. Therefore, this would mean 14% of the customers from the given observations
have churned and this shows a big imbalance in the data we are planning to predict
table(Churn)
## Churn
## 0 1
## 2850 483
Correlation analysis using correlation matrix
CorMatrix = cor(MyOrigData[sapply(MyOrigData, is.numeric)])
print(CorMatrix, digits=4)
##
## Call:
## lm(formula = Churn ~ ., data = MyOrigData[sapply(MyOrigData,
## is.numeric)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66572 -0.16629 -0.08236 0.02060 1.08844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.433e-01 5.363e-02 -2.672 0.007580 **
## AccountWeeks 8.888e-05 1.396e-04 0.637 0.524402
## ContractRenewal -2.993e-01 1.882e-02 -15.904 < 2e-16 ***
## DataPlan -4.175e-02 4.381e-02 -0.953 0.340650
## DataUsage -2.835e-02 1.933e-01 -0.147 0.883401
## CustServCalls 5.829e-02 4.222e-03 13.804 < 2e-16 ***
## DayMins 1.021e-03 3.272e-03 0.312 0.754936
## DayCalls 3.409e-04 2.769e-04 1.231 0.218433
## MonthlyCharge 1.428e-03 1.924e-02 0.074 0.940838
## OverageFee 1.046e-02 3.280e-02 0.319 0.749780
## RoamMins 8.765e-03 2.307e-03 3.800 0.000147 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3203 on 3322 degrees of freedom
## Multiple R-squared: 0.1747, Adjusted R-squared: 0.1722
## F-statistic: 70.31 on 10 and 3322 DF, p-value: < 2.2e-16
ChkReg =
lm(Churn~AccountWeeks+ContractRenewal+DataPlan+CustServCalls+DayCalls+Monthly
Charge+OverageFee+RoamMins, MyOrigData[sapply(MyOrigData, is.numeric)])
vif(ChkReg)
## $chisq
## [1] 1009.556
##
## $p.value
## [1] 3.436763e-182
##
## $df
## [1] 45
## eigen() decomposition
## $values
## [1] 2.7439248478 1.2293230352 1.0817999166 1.0153187841 0.9991982241
## [6] 0.9856449649 0.9663599709 0.9315024377 0.0467728635 0.0001549552
##
## $vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.01078436 -0.02615847 0.446911049 -0.36719778 -0.131136347
## [2,] 0.03287982 0.21062824 -0.427542299 -0.45672320 -0.173903976
## [3,] -0.54844367 0.30753623 -0.057661817 -0.05543676 -0.065570012
## [4,] -0.56378272 0.30151119 0.031868803 0.01096438 0.005714547
## [5,] 0.02491787 0.06585684 -0.261556961 -0.29508478 0.740921389
## [6,] -0.18723676 -0.75464372 -0.004786375 -0.25963262 0.195129328
## [7,] 0.00755354 -0.01308997 0.492682432 -0.44865945 -0.236970614
## [8,] -0.57019226 -0.27972913 -0.043344407 -0.04190411 0.018059062
## [9,] -0.10602860 -0.33487574 -0.247826777 0.36565607 -0.366139112
## [10,] -0.08980178 0.07126138 0.488538442 0.40266179 0.413097534
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.638599950 0.32007265 0.37030204 0.0076779048 2.243482e-04
## [2,] -0.146281760 -0.38518250 0.60261893 -0.0031562925 -3.387951e-04
## [3,] 0.007680696 0.09431575 -0.11330169 0.7563983438 1.812073e-03
## [4,] -0.002025028 0.02651264 -0.02870603 -0.5339087915 -5.508919e-01
## [5,] 0.423889697 -0.21500864 -0.25431578 -0.0016357041 1.566310e-04
## [6,] -0.267211784 0.09978660 0.12975195 0.1792480522 -3.995283e-01
## [7,] -0.128171959 -0.54354793 -0.43325748 -0.0008887734 -3.139859e-05
## [8,] -0.009598871 -0.04328868 0.04043845 -0.2946537862 7.086956e-01
## [9,] 0.544384795 -0.45628408 -0.04070791 0.0811571383 -1.861244e-01
## [10,] -0.082274234 -0.42177015 0.46211361 0.1311234952 3.094454e-04
EigenVector = ev$vectors
EigenValues = ev$values
plot(EigenValues,col="Red",xlab = "Factors",ylab="Eigen Values")
lines(EigenValues,col="Blue")
Logistic regression
. Logistic Regression is a classification algorithm used to predict a binary outcome. There
are various metrics to evaluate a logistic regression model such as confusion matrix, AUC-
ROC curve, that will be covered further down
. There are three different models plotted below taking a few variables off to identify the
one that has highly significant variables so less significant variables can be left aside to
arrive at a right model
. Churn is the response variable and the others are independent variables
. First Model (Model) - We see intercept and ContractRenewal, CustServCalls and
RoamMins appear significant
. Second Model (Model1) - We see intercept and apart from ContractRenewal,
CustServCalls, we see DataPlan and MonthlyCharge appearing significant this time
. Third Model (Model2) - We see intercept and ContractRenewal, CustServCalls, DataPlan
and MonthlyCharge appear significant. Therefore, removal of Datausage variable has given
a mucy better coefficient between the variables
Model=glm(Churn~.,data=MyOrigData, family = "binomial")
Model1=glm(Churn~CustServCalls+ContractRenewal+DataPlan+DataUsage+MonthlyChar
ge,data=MyOrigData, family = "binomial")
Model2=glm(Churn~CustServCalls+ContractRenewal+DataPlan+MonthlyCharge,data=My
OrigData, family = "binomial")
summary(Model)
##
## Call:
## glm(formula = Churn ~ ., family = "binomial", data = MyOrigData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0058 -0.5112 -0.3477 -0.2093 2.9981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.9510252 0.5486763 -10.846 < 2e-16 ***
## AccountWeeks 0.0006525 0.0013873 0.470 0.638112
## ContractRenewal -1.9855172 0.1436107 -13.826 < 2e-16 ***
## DataPlan -1.1841611 0.5363668 -2.208 0.027262 *
## DataUsage 0.3636565 1.9231751 0.189 0.850021
## CustServCalls 0.5081349 0.0389682 13.040 < 2e-16 ***
## DayMins 0.0174407 0.0324841 0.537 0.591337
## DayCalls 0.0036523 0.0027497 1.328 0.184097
## MonthlyCharge -0.0275526 0.1909074 -0.144 0.885244
## OverageFee 0.1868114 0.3256902 0.574 0.566248
## RoamMins 0.0789226 0.0220522 3.579 0.000345 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2758.3 on 3332 degrees of freedom
## Residual deviance: 2188.4 on 3322 degrees of freedom
## AIC: 2210.4
##
## Number of Fisher Scoring iterations: 5
summary(Model1)
##
## Call:
## glm(formula = Churn ~ CustServCalls + ContractRenewal + DataPlan +
## DataUsage + MonthlyCharge, family = "binomial", data = MyOrigData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9947 -0.5123 -0.3545 -0.2141 2.9915
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.62363 0.33459 -13.819 < 2e-16 ***
## CustServCalls 0.50301 0.03872 12.990 < 2e-16 ***
## ContractRenewal -1.99467 0.14284 -13.964 < 2e-16 ***
## DataPlan -1.85246 0.49382 -3.751 0.000176 ***
## DataUsage -0.41453 0.17464 -2.374 0.017617 *
## MonthlyCharge 0.07511 0.00580 12.949 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2758.3 on 3332 degrees of freedom
## Residual deviance: 2204.2 on 3327 degrees of freedom
## AIC: 2216.2
##
## Number of Fisher Scoring iterations: 5
summary(Model2)
##
## Call:
## glm(formula = Churn ~ CustServCalls + ContractRenewal + DataPlan +
## MonthlyCharge, family = "binomial", data = MyOrigData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9540 -0.5169 -0.3546 -0.2154 3.0407
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.464629 0.326144 -13.69 <2e-16 ***
## CustServCalls 0.504947 0.038701 13.05 <2e-16 ***
## ContractRenewal -1.979032 0.142307 -13.91 <2e-16 ***
## DataPlan -2.931443 0.224182 -13.08 <2e-16 ***
## MonthlyCharge 0.071188 0.005531 12.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2758.3 on 3332 degrees of freedom
## Residual deviance: 2209.8 on 3328 degrees of freedom
## AIC: 2219.8
##
## Number of Fisher Scoring iterations: 5
Removing of variables
. Dropping the DataUsage and DayMins variable which are redundant and is high cost data
. Instead we have DataPlan and DayCalls that provide the same information. Therefore,
retaining Dataplan and DayCalls as its easier to gather the same
myTrainData = select(MyOrigData,-DataUsage)
myTrainData = select(myTrainData,-DayMins)
table(Churn)
## Churn
## 0 1
## 2850 483
corrplot(cor(myTrainData[sapply(myTrainData, is.numeric)]))
Selecting random seed to reproduce results
set.seed(5)
## [1] 2501 9
dim(test)
## [1] 832 9
Fitting the model
ModelFit = glm(Churn~., data=train, family=binomial)
summary(ModelFit)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9296 -0.5154 -0.3607 -0.2183 2.9254
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.357608 0.595565 -8.996 <2e-16 ***
## AccountWeeks 0.001857 0.001595 1.164 0.2444
## ContractRenewal -1.960875 0.163852 -11.967 <2e-16 ***
## DataPlan -2.789980 0.268358 -10.396 <2e-16 ***
## CustServCalls 0.499149 0.044815 11.138 <2e-16 ***
## DayCalls 0.002578 0.003127 0.824 0.4097
## MonthlyCharge 0.064526 0.006834 9.442 <2e-16 ***
## OverageFee 0.022548 0.027352 0.824 0.4097
## RoamMins 0.053895 0.023234 2.320 0.0204 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2071.8 on 2500 degrees of freedom
## Residual deviance: 1667.6 on 2492 degrees of freedom
## AIC: 1685.6
##
## Number of Fisher Scoring iterations: 5
Make Predictions
. Optimizing the model by removing independent variables. After removing the
independent variables from the model we need to make sure the residual deviance does
not increase and AIC value should decrease.
. And then predict the values for testing dataset and predict the accuracy of the model. The
type is set to reponse thatmeans we want the probability of the testing dataset.
ChurnPred = predict(ModelFit, test, type="response")
head(ChurnPred)
## 2 8 10 16 17 19
## 0.05501892 0.11314444 0.30971705 0.77194855 0.13201872 0.02970964
. Looking at the response encoding
contrasts(train$Churn)
## 1
## 0 0
## 1 1
Confusion matrix
. Since we are dealing with unbalanced data, a check on accuracy is not appropraite and
therefore, we will look at the other measures to check the accuracy of the model
. it looks like the model is performing fairly well. If we were to predict that all results in the
test set were the majority class (1), the accuracy would be 60%
. The response classes are slightly unbalanced.
. The sensitivity, which is a measure of the true positive rate (TP/(TP+FN)), is at 18%.
. The specificity, or true negative rate (TN/(TN+FP)), is 98%.
. This tells us that our model has correctly identified 18% of people that actually churned.
. Positive predicted value is at 60%
. Negative Predicted values is at 88%
confusionMatrix(GlmPred, as.factor(test$Churn), positive = "1")
ROC
. Now let us find the threshold using ROC curve. the threshhold value is 0.5.The baseline
performance from random guessing provides a 45 degree line, so we can see that our
model is outperforming random guessing, which is good. The AUC measure is 0.81, which is
greater than 0.5 (baseline model), which is also good.
ResTr = predict(ModelFit,train,type = "response")
Pred = prediction(ResTr,train$Churn)
ROCRperf = performance(Pred,"tpr","fpr")
plot(ROCRperf,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))
AUC
. The AUC can take on any value between 0 and 1, with 1 being the best. This is a easy way
to arrive at the ROC to a single number for evaluating a model. As can be seen our model
has an AUC of 0.81, which is pretty good. If we were to just make random guesses, our ROC
would be a 45 degree line. This would correspond to an AUC of 0.5.
AUCPred = performance(Pred, measure = "auc")
AUCPred = [email protected][[1]]
AUCPred
## [1] 0.8128835
. Let us see if we can further improve the model. we can take a look at the summary of the
fit and identify which features are significant (p-value < 0.05). As can be seen below,
ContractRenewal, Dataplan, MonthlyCharge and CustServCalls appear to be highly
significant
summary(ModelFit)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9296 -0.5154 -0.3607 -0.2183 2.9254
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.357608 0.595565 -8.996 <2e-16 ***
## AccountWeeks 0.001857 0.001595 1.164 0.2444
## ContractRenewal -1.960875 0.163852 -11.967 <2e-16 ***
## DataPlan -2.789980 0.268358 -10.396 <2e-16 ***
## CustServCalls 0.499149 0.044815 11.138 <2e-16 ***
## DayCalls 0.002578 0.003127 0.824 0.4097
## MonthlyCharge 0.064526 0.006834 9.442 <2e-16 ***
## OverageFee 0.022548 0.027352 0.824 0.4097
## RoamMins 0.053895 0.023234 2.320 0.0204 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2071.8 on 2500 degrees of freedom
## Residual deviance: 1667.6 on 2492 degrees of freedom
## AIC: 1685.6
##
## Number of Fisher Scoring iterations: 5
sort(varImp(ModelFit), decreasing = TRUE)
## Overall
## ContractRenewal 11.9673220
## CustServCalls 11.1379741
## DataPlan 10.3964674
## MonthlyCharge 9.4415713
## RoamMins 2.3196894
## AccountWeeks 1.1640158
## DayCalls 0.8244680
## OverageFee 0.8243733
## 2 8 10 16 17 19
## 0.05555871 0.11386086 0.29679246 0.74103784 0.12968470 0.03108179
contrasts(test$Churn)
## 1
## 0 0
## 1 1
## [1] 0.8287687
KNN Modeling
. Caret package provides train() method for training our data for various algorithms. We
just need to pass different parameter values for different algorithms. Before train()
method, we will first use trainControl() method. It controls the computational part of the
train() method.
. We are setting three parameters of trainControl() method. The method parameter holds
the details about resampling method.
. The number parameter holds the number of resampling iterations. The repeats parameter
contains the complete sets of folds to compute for our repeated cross-validation.
. We are using setting number as 10 and repeats as 3. This trainControl() methods returns a
list. We are going to pass this on our train() method.
. Before training our knn classifier, set.seed().
. For training knn classifier, train() method should be passed withmethod parameter as
knn.
. We are passing our target variable Churn. The Chrun~. denotes a formula for using all
attributes in our classifier and Chrun as the target variable. The trControl parameter should
be passed with results from our trianControl() method.
. The preProcess parameter is for preprocessing our training data and preprocessing is a
mandatory task. We are passing 2 values in our preProcess parameter center & scale. These
two help for centering and scaling the data
. After preProcessing these convert our training data with mean value as approximately 0
and standard deviation as 1 The tuneLength parameter holds an integer value. This is for
tuning our algorithm.
trctrl = trainControl(method = "repeatedcv", number = 10, repeats =3)
set.seed(5555)
KNNFit = train(Churn~., data = train, method = "knn",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
. Its showing Accuracy and Kappa metrics result for different k value. From the results, it
automatically selects best k-value. Here, our training model is choosing k = 9 as its final
value. We can see the variation in the Accuracy with respect to K-value by plotting these in
a graph.
KNNFit
## k-Nearest Neighbors
##
## 2501 samples
## 8 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2251, 2251, 2252, 2250, 2250, 2251, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8811255 0.3740296
## 7 0.8809922 0.3606353
## 9 0.8825842 0.3487018
## 11 0.8801816 0.3194724
## 13 0.8752503 0.2815914
## 15 0.8739170 0.2664852
## 17 0.8733794 0.2480511
## 19 0.8717820 0.2285708
## 21 0.8705863 0.2130558
## 23 0.8703186 0.2109972
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
plot(KNNFit)
. Now, our model is trained with K value as 9. We are ready to predict classes for our test
set. We can use predict() method. caret package provides predict() method for predicting
results. We are passing 2 arguments. The first parameter is our trained model and second
parameter newdata holds our testing data frame. The predict() method returns a list, we
are saving it in a TestPred variable.
TestPred = predict(KNNFit, newdata = test)
TestPred
## [1] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
0
## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
0
## [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0
## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0
## [141] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0
## [176] 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0
## [211] 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [246] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [281] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [316] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [351] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [386] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0
## [421] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0
## [456] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [491] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0
## [526] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [561] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1
## [596] 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0
## [631] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [666] 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [701] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [736] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [771] 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0
0
## [806] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1
Confusion Matrix
. The balanced accuracy value has gone up by 3% and is now at 63%
. The sensitivity, which is a measure of the true positive rate (TP/(TP+FN)), is now at 29%
with slight increase
. The specificity, or true negative rate (TN/(TN+FP)), has remained at 99%.
. Positive predicted value has gone up by 20% and is at 80%
. Negative Predicted values has remained at 89%
confusionMatrix(TestPred, test$Churn, positive = "1" )
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 703 85
## 1 9 35
##
## Accuracy : 0.887
## 95% CI : (0.8635, 0.9077)
## No Information Rate : 0.8558
## P-Value [Acc > NIR] : 0.004854
##
## Kappa : 0.3787
##
## Mcnemar's Test P-Value : 1.029e-14
##
## Sensitivity : 0.29167
## Specificity : 0.98736
## Pos Pred Value : 0.79545
## Neg Pred Value : 0.89213
## Prevalence : 0.14423
## Detection Rate : 0.04207
## Detection Prevalence : 0.05288
## Balanced Accuracy : 0.63951
##
## 'Positive' Class : 1
##
## [1] 0.6395131
length(NBChrPred)
## [1] 832
table(pred=NBChrPred,test[,1])
##
## pred 0 1
## 0 656 74
## 1 56 46
str(NBChrPred)