0% found this document useful (1 vote)
455 views

Telecom Customer Churn

The document describes a project to analyze customer churn data for a telecom company. It loads various libraries and the customer data set, which contains 3333 observations and 11 variables related to customer usage, contract, payment details, and whether they canceled service. Exploratory data analysis shows the data includes integer and numeric variables, and describes the variables and provides summary statistics of the data.

Uploaded by

salmagm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
455 views

Telecom Customer Churn

The document describes a project to analyze customer churn data for a telecom company. It loads various libraries and the customer data set, which contains 3333 observations and 11 variables related to customer usage, contract, payment details, and whether they canceled service. Exploratory data analysis shows the data includes integer and numeric variables, and describes the variables and provides summary statistics of the data.

Uploaded by

salmagm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Telecom Customer Churn

Salma Mohiuddin

24/11/2019

Introduction
Customer Churn is a burning problem for Telecom companies. In this project, we simulate
one such case of customer churn where we work on a data of postpaid customers with a
contract. The data has information about the customer usage behavior, contract details and
the payment details. The data also indicates which were the customers who canceled their
service. Based on this past data, we need to build a model which can predict whether a
customer will cancel their service in the future or not.

Loading the libraries


library(ggplot2)
library(corrplot)

## corrplot 0.84 loaded

library(psych)

##
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':


##
## %+%, alpha

library(DataExplorer)
library(car)

## Loading required package: carData

##
## Attaching package: 'car'

## The following object is masked from 'package:psych':


##
## logit

library(dplyr)

##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

library(purrr)

##
## Attaching package: 'purrr'

## The following object is masked from 'package:car':


##
## some

library(caret)

## Loading required package: lattice

##
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':


##
## lift

library(contrast)

## Loading required package: rms

## Loading required package: Hmisc

## Loading required package: survival

##
## Attaching package: 'survival'

## The following object is masked from 'package:caret':


##
## cluster

## Loading required package: Formula

##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize

## The following object is masked from 'package:psych':


##
## describe

## The following objects are masked from 'package:base':


##
## format.pval, units

## Loading required package: SparseM

##
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':


##
## backsolve

##
## Attaching package: 'rms'

## The following objects are masked from 'package:car':


##
## Predict, vif

library(miscset)

##
## Attaching package: 'miscset'

## The following object is masked from 'package:dplyr':


##
## collapse

library(ROCR)

## Loading required package: gplots

##
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':


##
## lowess

library(class)
library(e1071)

##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute

Setting up the workspace and Loading the dataset


Loading and attaching the data set - the data worksheet from the excel spreadsheet was
copied over to a CSV file for ease of operations. Since the dataset contains 3333
observations, printing just the top and bottom records using the head() and tail() functions
setwd("E:/WorkSpace/Predictive Analytics/Project")

MyOrigData = read.csv("CellphoneCSV.csv",header=TRUE)
attach(MyOrigData)
head(MyOrigData)

## Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls


## 1 0 128 1 1 2.7 1
## 2 0 107 1 1 3.7 1
## 3 0 137 1 0 0.0 0
## 4 0 84 0 0 0.0 2
## 5 0 75 0 0 0.0 3
## 6 0 118 0 0 0.0 0
## DayMins DayCalls MonthlyCharge OverageFee RoamMins
## 1 265.1 110 89 9.87 10.0
## 2 161.6 123 82 9.78 13.7
## 3 243.4 114 52 6.06 12.2
## 4 299.4 71 57 3.10 6.6
## 5 166.7 113 41 7.42 10.1
## 6 223.4 98 57 11.03 6.3

tail(MyOrigData)

## Churn AccountWeeks ContractRenewal DataPlan DataUsage CustServCalls


## 3328 0 79 1 0 0.00 2
## 3329 0 192 1 1 2.67 2
## 3330 0 68 1 0 0.34 3
## 3331 0 28 1 0 0.00 2
## 3332 0 184 0 0 0.00 2
## 3333 0 74 1 1 3.70 0
## DayMins DayCalls MonthlyCharge OverageFee RoamMins
## 3328 134.7 98 40.0 9.49 11.8
## 3329 156.2 77 71.7 10.78 9.9
## 3330 231.1 57 56.4 7.67 9.6
## 3331 180.8 109 56.0 14.44 14.1
## 3332 213.8 105 50.0 7.98 5.0
## 3333 234.4 113 100.0 13.30 13.7
Exploratory Data Analysis
. The dataset contains 3333 observations and 11 variables of which one is response
variable and the others are independent/predictor variables
dim(MyOrigData)

## [1] 3333 11

Names of the variables in the dataset


The output that we are going to work and predict is “Churn”, a response variable As can be
seen, all the columns have been aptly named so there is no need for any change
. “Churn” is the response variable and contains binary values - 1 indicating the customer
has churned or 0 indicataing the customer has not churned. This would need to be changed
to factor
. “AccountWeeks” indicating the number of weeks the customer has had active account
. “ContractRenewal” has binary values too with 1 indicating that the customer recently
renewed the contract or 0 if not renewed the contract to continue with the service from the
same service provider
. “DataPlan” contains binary value too with 1 if customer has a data plan from the service
provider, 0 if no data plan has been taken
. “DataUsage” is recorded in gigabytes indicating monthly data usage
. “CustServCalls” indicates the number of calls the customer made to the customer service
. “DayMins” records the average daytime minutes per month
. “DayCalls” is the average number of daytime calls made in the month
. “MonthlyCharge” is the average monthly bill
. “OverageFee” is the largest overage fee in last 12 months
. “RoamMins” is the average number of roaming minutes in the month
names(MyOrigData)

## [1] "Churn" "AccountWeeks" "ContractRenewal"


## [4] "DataPlan" "DataUsage" "CustServCalls"
## [7] "DayMins" "DayCalls" "MonthlyCharge"
## [10] "OverageFee" "RoamMins"

The dataset has 6 integer variables and 5 numeric variables


str(MyOrigData)
## 'data.frame': 3333 obs. of 11 variables:
## $ Churn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AccountWeeks : int 128 107 137 84 75 118 121 147 117 141 ...
## $ ContractRenewal: int 1 1 1 0 0 0 1 0 1 0 ...
## $ DataPlan : int 1 1 0 0 0 0 1 0 0 1 ...
## $ DataUsage : num 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
## $ CustServCalls : int 1 1 0 2 3 0 3 0 1 0 ...
## $ DayMins : num 265 162 243 299 167 ...
## $ DayCalls : int 110 123 114 71 113 98 88 79 97 84 ...
## $ MonthlyCharge : num 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
## $ OverageFee : num 9.87 9.78 6.06 3.1 7.42 ...
## $ RoamMins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...

describe(MyOrigData)

## MyOrigData
##
## 11 Variables 3333 Observations
## --------------------------------------------------------------------------
-
## Churn
## n missing distinct Info Sum Mean Gmd
## 3333 0 2 0.372 483 0.1449 0.2479
##
## --------------------------------------------------------------------------
-
## AccountWeeks
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 212 1 101.1 45.01 35 50
## .25 .50 .75 .90 .95
## 74 101 127 152 167
##
## lowest : 1 2 3 4 5, highest: 221 224 225 232 243
## --------------------------------------------------------------------------
-
## ContractRenewal
## n missing distinct Info Sum Mean Gmd
## 3333 0 2 0.263 3010 0.9031 0.1751
##
## --------------------------------------------------------------------------
-
## DataPlan
## n missing distinct Info Sum Mean Gmd
## 3333 0 2 0.6 922 0.2766 0.4003
##
## --------------------------------------------------------------------------
-
## DataUsage
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 174 0.839 0.8165 1.202 0.00 0.00
## .25 .50 .75 .90 .95
## 0.00 0.00 1.78 3.05 3.46
##
## lowest : 0.00 0.11 0.12 0.13 0.14, highest: 4.59 4.64 4.73 4.75 5.40
## --------------------------------------------------------------------------
-
## CustServCalls
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 10 0.932 1.563 1.392 0 0
## .25 .50 .75 .90 .95
## 1 1 2 3 4
##
## lowest : 0 1 2 3 4, highest: 5 6 7 8 9
##
## Value 0 1 2 3 4 5 6 7 8 9
## Frequency 697 1181 759 429 166 66 22 9 2 2
## Proportion 0.209 0.354 0.228 0.129 0.050 0.020 0.007 0.003 0.001 0.001
## --------------------------------------------------------------------------
-
## DayMins
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 1667 1 179.8 61.46 89.92 110.32
## .25 .50 .75 .90 .95
## 143.70 179.40 216.40 249.58 270.74
##
## lowest : 0.0 2.6 7.8 7.9 12.5, highest: 335.5 337.4 345.3 346.8
350.8
## --------------------------------------------------------------------------
-
## DayCalls
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 119 1 100.4 22.59 67.0 74.2
## .25 .50 .75 .90 .95
## 87.0 101.0 114.0 126.0 133.0
##
## lowest : 0 30 35 36 40, highest: 157 158 160 163 165
## --------------------------------------------------------------------------
-
## MonthlyCharge
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 627 1 56.31 18.35 33.26 38.00
## .25 .50 .75 .90 .95
## 45.00 53.50 66.20 80.50 87.80
##
## lowest : 14.0 15.7 16.0 17.0 19.0, highest: 108.3 108.6 108.7 110.0
111.3
## --------------------------------------------------------------------------
-
## OverageFee
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 1024 1 10.05 2.86 5.94 6.84
## .25 .50 .75 .90 .95
## 8.33 10.07 11.77 13.29 14.22
##
## lowest : 0.00 1.56 2.11 2.13 2.20, highest: 17.55 17.58 17.71 18.09
18.19
## --------------------------------------------------------------------------
-
## RoamMins
## n missing distinct Info Mean Gmd .05 .10
## 3333 0 162 1 10.24 3.114 5.7 6.7
## .25 .50 .75 .90 .95
## 8.5 10.3 12.1 13.7 14.7
##
## lowest : 0.0 1.1 1.3 2.0 2.1, highest: 18.2 18.3 18.4 18.9 20.0
## --------------------------------------------------------------------------
-

Summary of all the variables


. From t he summary below, we can see“Churn” contains binary values and so is
“ContractRenewal” and “Dataplan” containing 0 (No) and 1 (Yes)
. “AccountWeeks” has min as 1.0 and max 243 with mean of 101.1 which means the
customer account hasn’t been active for that many weeks
. “DataUsage” has min as 0 and max 5.4000 and mean of 0.8165 which is in gigabytes of
data used during the month. Customers who didn’t opt for data plan will have lower
datausage or 0
. “CustServCalls” contains min as 0 and max as 9 with mean of 1.563 which means the
customer has been making calls to customer server on an average
. “DayMins” has 0 as min and 350.8 as max with mean of 179.0 where the customer utilized
average daytime minutes per month
. “DayCalls” has 0 as the min and 165 as max with mean of 100.4 and is the average number
of daytime calls made by the customer during the month
. “MonthlyCharge” has 14 as min and 111.30 as max with mean of 56.31 and is the average
monthly bill subject to the plan opted from the service provider
. “OverageFee” has 0 as min and 18.19 as max with mean of 10.05 and indictes the largest
overage fee in the last 12 months
. "RoamMins’ has 0 as min and 20.00 as max with mean of 10.24 and is th average number
of roaming minutes for the month
summary(MyOrigData)
## Churn AccountWeeks ContractRenewal DataPlan
## Min. :0.0000 Min. : 1.0 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 74.0 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.0000 Median :101.0 Median :1.0000 Median :0.0000
## Mean :0.1449 Mean :101.1 Mean :0.9031 Mean :0.2766
## 3rd Qu.:0.0000 3rd Qu.:127.0 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :243.0 Max. :1.0000 Max. :1.0000
## DataUsage CustServCalls DayMins DayCalls
## Min. :0.0000 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:143.7 1st Qu.: 87.0
## Median :0.0000 Median :1.000 Median :179.4 Median :101.0
## Mean :0.8165 Mean :1.563 Mean :179.8 Mean :100.4
## 3rd Qu.:1.7800 3rd Qu.:2.000 3rd Qu.:216.4 3rd Qu.:114.0
## Max. :5.4000 Max. :9.000 Max. :350.8 Max. :165.0
## MonthlyCharge OverageFee RoamMins
## Min. : 14.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 45.00 1st Qu.: 8.33 1st Qu.: 8.50
## Median : 53.50 Median :10.07 Median :10.30
## Mean : 56.31 Mean :10.05 Mean :10.24
## 3rd Qu.: 66.20 3rd Qu.:11.77 3rd Qu.:12.10
## Max. :111.30 Max. :18.19 Max. :20.00

Univariate variables
. Plotting histogram for all the variables in the dataset.
. The independent variables, “AccountWeeks”, “DayCalls”, “DayMins”, “MonthlyCharges”,
“OverageFee” anad “RoamMins” are normally distributed
. “CustServCalls” is skewed to the left and indicates that there have been more number of
customers making a few calls where as a few customers making more number of calls to
customer service which shows the dissatisfaction level.
. “DataUsage” has normal distribution around the middle and some skewness on the left
highlighting either no data usage or low data usage by the majority of the customers who
maybe getting their data usage outside this service provider
plot_histogram(MyOrigData, title = "Histogram of all the variables")
. Also, from the data graph exploration below, we can see that those customers who have
not renewed their contract are more likely (about 74%) to churn when compared to those
who renew their contracts so we need to try and get the renewals done though a follow up
to retain the customers
ggplot(MyOrigData) +
geom_bar(aes(x = ContractRenewal, fill = Churn), position = "dodge")
. And, those customers who do not have a dataplan are more likely (about 20%) to churn
when compared to those who have a dataplan so focus needs to also be given on providing
some offers so the existing customers are retained that will result in low cost
ggplot(MyOrigData) +
geom_bar(aes(x = DataPlan, fill = Churn), position = "dodge")
. Similarly, we can see that those customers who have made more calls to the customer
service have been most dissatisfied with the service therefore increasing the likelyhood of
churning with increased number of calls
. On the average, we can see the customemrs who have made 4 calls were about 50% likely
to churn and customer making 9 calls were 100% likely to churn therefore it is important
to look at the concerns that are a cause of dissatisfaction and this will help in retaining the
customers
ggplot(MyOrigData) +
geom_bar(aes(x = CustServCalls, fill = Churn), position = "dodge")
Multivariate variables
. The correlation matrix table is showing the correlation coefficients between the variables.
Each cell in the table shows the correlation between two variables. From what we see on
the correlation matrix below
. The DataPlan and DataUsage show high positive correlation with each other both being
independent variables
. MonthlyCharges show high positive correlation with DataPlan and DataUsage and
moderate positive correlation with DayMins and low positive correlation with OverageFee
and RoamMins
. Churn show low positive correlation between CustServCalls and DayMins and low
negative correlation between ContractRenewal
corrplot(cor(MyOrigData))
Check for any missing Values
. None of the variables have missing values therefore no action is required and the dataset
is complete
MyOrigData %>% map(~ sum(is.na(.)))

## $Churn
## [1] 0
##
## $AccountWeeks
## [1] 0
##
## $ContractRenewal
## [1] 0
##
## $DataPlan
## [1] 0
##
## $DataUsage
## [1] 0
##
## $CustServCalls
## [1] 0
##
## $DayMins
## [1] 0
##
## $DayCalls
## [1] 0
##
## $MonthlyCharge
## [1] 0
##
## $OverageFee
## [1] 0
##
## $RoamMins
## [1] 0

Identifying Outliers
. Another useful visualization is the box and whisker plot. This provides us a little bit more
high level visual of our data, and help us identify the outliers if any. Lets take a look at some
box and whisker plots for alll the independent variables. we can see outliers
. Most of the values are concentrated around the median and values that are falling below
the 1st Quartile and the values higher than the 3rd Quartile appear as outliers. According to
me these don’t need to be treated
. The outlier on the MonthlyCharge are likely to be those customers where the usage has
been more therefore having some higher values
. Many customers have not used the service for many weeks therefore we have some very
extreme values highlighted as outliers
. We see outliers for DayMins and DayCalls as well. Moving them to a plan without phone
service to save them some money on their bill might help retain them.
boxplot(MonthlyCharge, DayMins, DayCalls,
AccountWeeks,OverageFee,RoamMins,DataUsage,
main = "Multiple boxplots for comparision",
at = c(1,2,4,5,6,7,8),
names = c("MonthlyCharge", "DayMins", "DayCalls",
"AccountWeeks","OverageFee","RoamMins","DataUsage" ),
col = "red",
border = "brown")
Variable Transformation
. We need to change the response variable “Chrun” from Integer to Factor to be able to
apply the classification model (Logistic Regression) to predict the customers who are likely
to Churn in the future.
. The rest of the variables are retained as is
MyOrigData$Churn <- as.factor(MyOrigData$Churn)

Summary of Churn
. As can be seen from the table below, 483 customers have churned in the past and 2850
have not. Therefore, this would mean 14% of the customers from the given observations
have churned and this shows a big imbalance in the data we are planning to predict
table(Churn)

## Churn
## 0 1
## 2850 483
Correlation analysis using correlation matrix
CorMatrix = cor(MyOrigData[sapply(MyOrigData, is.numeric)])
print(CorMatrix, digits=4)

## AccountWeeks ContractRenewal DataPlan DataUsage


## AccountWeeks 1.000000 -0.024735 0.002918 0.014391
## ContractRenewal -0.024735 1.000000 -0.006006 -0.019223
## DataPlan 0.002918 -0.006006 1.000000 0.945982
## DataUsage 0.014391 -0.019223 0.945982 1.000000
## CustServCalls -0.003796 0.024522 -0.017824 -0.021723
## DayMins 0.006216 -0.049396 -0.001684 0.003176
## DayCalls 0.038470 -0.003755 -0.011086 -0.007962
## MonthlyCharge 0.012581 -0.047291 0.737490 0.781660
## OverageFee -0.006749 -0.019105 0.021526 0.019637
## RoamMins 0.009514 -0.045871 -0.001318 0.162746
## CustServCalls DayMins DayCalls MonthlyCharge OverageFee
## AccountWeeks -0.003796 0.006216 0.038470 0.012581 -0.006749
## ContractRenewal 0.024522 -0.049396 -0.003755 -0.047291 -0.019105
## DataPlan -0.017824 -0.001684 -0.011086 0.737490 0.021526
## DataUsage -0.021723 0.003176 -0.007962 0.781660 0.019637
## CustServCalls 1.000000 -0.013423 -0.018942 -0.028017 -0.012964
## DayMins -0.013423 1.000000 0.006750 0.567968 0.007038
## DayCalls -0.018942 0.006750 1.000000 -0.007963 -0.021449
## MonthlyCharge -0.028017 0.567968 -0.007963 1.000000 0.281766
## OverageFee -0.012964 0.007038 -0.021449 0.281766 1.000000
## RoamMins -0.009640 -0.010155 0.021565 0.117433 -0.011023
## RoamMins
## AccountWeeks 0.009514
## ContractRenewal -0.045871
## DataPlan -0.001318
## DataUsage 0.162746
## CustServCalls -0.009640
## DayMins -0.010155
## DayCalls 0.021565
## MonthlyCharge 0.117433
## OverageFee -0.011023
## RoamMins 1.000000

Checking for multi-collinearity using correlation matrix


. The summary shows ContractRenewal, CustServCalls and RoamMins to be highy
significant
ChkReg = lm(Churn~., MyOrigData[sapply(MyOrigData, is.numeric)])
summary(ChkReg)

##
## Call:
## lm(formula = Churn ~ ., data = MyOrigData[sapply(MyOrigData,
## is.numeric)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66572 -0.16629 -0.08236 0.02060 1.08844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.433e-01 5.363e-02 -2.672 0.007580 **
## AccountWeeks 8.888e-05 1.396e-04 0.637 0.524402
## ContractRenewal -2.993e-01 1.882e-02 -15.904 < 2e-16 ***
## DataPlan -4.175e-02 4.381e-02 -0.953 0.340650
## DataUsage -2.835e-02 1.933e-01 -0.147 0.883401
## CustServCalls 5.829e-02 4.222e-03 13.804 < 2e-16 ***
## DayMins 1.021e-03 3.272e-03 0.312 0.754936
## DayCalls 3.409e-04 2.769e-04 1.231 0.218433
## MonthlyCharge 1.428e-03 1.924e-02 0.074 0.940838
## OverageFee 1.046e-02 3.280e-02 0.319 0.749780
## RoamMins 8.765e-03 2.307e-03 3.800 0.000147 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3203 on 3322 degrees of freedom
## Multiple R-squared: 0.1747, Adjusted R-squared: 0.1722
## F-statistic: 70.31 on 10 and 3322 DF, p-value: < 2.2e-16

Checking for multi-collinearity using VIF


vif(ChkReg)

## AccountWeeks ContractRenewal DataPlan DataUsage


## 1.003791 1.007216 12.473470 1964.800207
## CustServCalls DayMins DayCalls MonthlyCharge
## 1.001945 1031.490608 1.002935 3243.300555
## OverageFee RoamMins
## 224.639750 1.346583

Checking for corellation coefficient after removing few variables


. DataUsage and DayMin one aftere the other to see if the variance changes. The DataUsage
and DataPlan are related as subject to the DataPlan taken the DataUsage will take place.
Therefore, DataUsage can be removed as it is high on cost and retain DataPlan instead.
Similarly is with DayMins
. As can be seen, removing the varaiable DataUsage and DayMins, the correlation coefficient
has changed for all except for MonthlyCharge which is not a cause of concern
ChkReg =
lm(Churn~AccountWeeks+ContractRenewal+DataPlan+CustServCalls+DayMins+DayCalls
+MonthlyCharge+OverageFee+RoamMins, MyOrigData[sapply(MyOrigData,
is.numeric)])
vif(ChkReg)

## AccountWeeks ContractRenewal DataPlan CustServCalls


## 1.003392 1.006436 12.360704 1.001801
## DayMins DayCalls MonthlyCharge OverageFee
## 7.836054 1.002927 21.151777 2.468149
## RoamMins
## 1.343233

ChkReg =
lm(Churn~AccountWeeks+ContractRenewal+DataPlan+CustServCalls+DayCalls+Monthly
Charge+OverageFee+RoamMins, MyOrigData[sapply(MyOrigData, is.numeric)])
vif(ChkReg)

## AccountWeeks ContractRenewal DataPlan CustServCalls


## 1.002484 1.006433 2.449611 1.001755
## DayCalls MonthlyCharge OverageFee RoamMins
## 1.002877 2.707015 1.194202 1.041569

Kaiser Meyer Olkin test


. Is a statistical measure that indicates the proportion of variance in the variables. High
values (close to 1.0) generally indicate that a factor analysis may be useful for the data. If
the value is less than 0.50, the results of the factor analysis probably won’t be very useful.
. As can be seen the overall MSA is below 0.5 which means the factor analysis will not be
useful because there is no adequate sample datapoint for dimension reduction
KMO(CorMatrix)

## Kaiser-Meyer-Olkin factor adequacy


## Call: KMO(r = CorMatrix)
## Overall MSA = 0.29
## MSA for each item =
## AccountWeeks ContractRenewal DataPlan DataUsage
## 0.37 0.63 0.85 0.34
## CustServCalls DayMins DayCalls MonthlyCharge
## 0.66 0.10 0.57 0.34
## OverageFee RoamMins
## 0.03 0.15

Barlett’s Test of Sphericity


. Another test performed on a correlation matrix, which would indicate that the variables
are unrelated and therefore unsuitable for structure detection. Small values (less than
0.05) of the significance level indicate that a factor analysis may be useful for the data.
Given this is a unbalanced data some amount of variation will be expected
cortest.bartlett(CorMatrix)

## Warning in cortest.bartlett(CorMatrix): n not specified, 100 used

## $chisq
## [1] 1009.556
##
## $p.value
## [1] 3.436763e-182
##
## $df
## [1] 45

Identify Eigen Values, Relative and Commulative Variance


. Scree plot using Eigen values - we can see a sharp curve at 2 in the elbow curve but since it
is being even number, we could take 3 and also 5 as after this poin there is very little
change to the curve and similar pattern can be seen at 7. Therefore, K can be picked as 5 or
7 at this point
ev = eigen(CorMatrix)
ev

## eigen() decomposition
## $values
## [1] 2.7439248478 1.2293230352 1.0817999166 1.0153187841 0.9991982241
## [6] 0.9856449649 0.9663599709 0.9315024377 0.0467728635 0.0001549552
##
## $vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.01078436 -0.02615847 0.446911049 -0.36719778 -0.131136347
## [2,] 0.03287982 0.21062824 -0.427542299 -0.45672320 -0.173903976
## [3,] -0.54844367 0.30753623 -0.057661817 -0.05543676 -0.065570012
## [4,] -0.56378272 0.30151119 0.031868803 0.01096438 0.005714547
## [5,] 0.02491787 0.06585684 -0.261556961 -0.29508478 0.740921389
## [6,] -0.18723676 -0.75464372 -0.004786375 -0.25963262 0.195129328
## [7,] 0.00755354 -0.01308997 0.492682432 -0.44865945 -0.236970614
## [8,] -0.57019226 -0.27972913 -0.043344407 -0.04190411 0.018059062
## [9,] -0.10602860 -0.33487574 -0.247826777 0.36565607 -0.366139112
## [10,] -0.08980178 0.07126138 0.488538442 0.40266179 0.413097534
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.638599950 0.32007265 0.37030204 0.0076779048 2.243482e-04
## [2,] -0.146281760 -0.38518250 0.60261893 -0.0031562925 -3.387951e-04
## [3,] 0.007680696 0.09431575 -0.11330169 0.7563983438 1.812073e-03
## [4,] -0.002025028 0.02651264 -0.02870603 -0.5339087915 -5.508919e-01
## [5,] 0.423889697 -0.21500864 -0.25431578 -0.0016357041 1.566310e-04
## [6,] -0.267211784 0.09978660 0.12975195 0.1792480522 -3.995283e-01
## [7,] -0.128171959 -0.54354793 -0.43325748 -0.0008887734 -3.139859e-05
## [8,] -0.009598871 -0.04328868 0.04043845 -0.2946537862 7.086956e-01
## [9,] 0.544384795 -0.45628408 -0.04070791 0.0811571383 -1.861244e-01
## [10,] -0.082274234 -0.42177015 0.46211361 0.1311234952 3.094454e-04

EigenVector = ev$vectors
EigenValues = ev$values
plot(EigenValues,col="Red",xlab = "Factors",ylab="Eigen Values")
lines(EigenValues,col="Blue")

Logistic regression
. Logistic Regression is a classification algorithm used to predict a binary outcome. There
are various metrics to evaluate a logistic regression model such as confusion matrix, AUC-
ROC curve, that will be covered further down
. There are three different models plotted below taking a few variables off to identify the
one that has highly significant variables so less significant variables can be left aside to
arrive at a right model
. Churn is the response variable and the others are independent variables
. First Model (Model) - We see intercept and ContractRenewal, CustServCalls and
RoamMins appear significant
. Second Model (Model1) - We see intercept and apart from ContractRenewal,
CustServCalls, we see DataPlan and MonthlyCharge appearing significant this time
. Third Model (Model2) - We see intercept and ContractRenewal, CustServCalls, DataPlan
and MonthlyCharge appear significant. Therefore, removal of Datausage variable has given
a mucy better coefficient between the variables
Model=glm(Churn~.,data=MyOrigData, family = "binomial")
Model1=glm(Churn~CustServCalls+ContractRenewal+DataPlan+DataUsage+MonthlyChar
ge,data=MyOrigData, family = "binomial")
Model2=glm(Churn~CustServCalls+ContractRenewal+DataPlan+MonthlyCharge,data=My
OrigData, family = "binomial")

summary(Model)

##
## Call:
## glm(formula = Churn ~ ., family = "binomial", data = MyOrigData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0058 -0.5112 -0.3477 -0.2093 2.9981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.9510252 0.5486763 -10.846 < 2e-16 ***
## AccountWeeks 0.0006525 0.0013873 0.470 0.638112
## ContractRenewal -1.9855172 0.1436107 -13.826 < 2e-16 ***
## DataPlan -1.1841611 0.5363668 -2.208 0.027262 *
## DataUsage 0.3636565 1.9231751 0.189 0.850021
## CustServCalls 0.5081349 0.0389682 13.040 < 2e-16 ***
## DayMins 0.0174407 0.0324841 0.537 0.591337
## DayCalls 0.0036523 0.0027497 1.328 0.184097
## MonthlyCharge -0.0275526 0.1909074 -0.144 0.885244
## OverageFee 0.1868114 0.3256902 0.574 0.566248
## RoamMins 0.0789226 0.0220522 3.579 0.000345 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2758.3 on 3332 degrees of freedom
## Residual deviance: 2188.4 on 3322 degrees of freedom
## AIC: 2210.4
##
## Number of Fisher Scoring iterations: 5

summary(Model1)

##
## Call:
## glm(formula = Churn ~ CustServCalls + ContractRenewal + DataPlan +
## DataUsage + MonthlyCharge, family = "binomial", data = MyOrigData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9947 -0.5123 -0.3545 -0.2141 2.9915
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.62363 0.33459 -13.819 < 2e-16 ***
## CustServCalls 0.50301 0.03872 12.990 < 2e-16 ***
## ContractRenewal -1.99467 0.14284 -13.964 < 2e-16 ***
## DataPlan -1.85246 0.49382 -3.751 0.000176 ***
## DataUsage -0.41453 0.17464 -2.374 0.017617 *
## MonthlyCharge 0.07511 0.00580 12.949 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2758.3 on 3332 degrees of freedom
## Residual deviance: 2204.2 on 3327 degrees of freedom
## AIC: 2216.2
##
## Number of Fisher Scoring iterations: 5

summary(Model2)

##
## Call:
## glm(formula = Churn ~ CustServCalls + ContractRenewal + DataPlan +
## MonthlyCharge, family = "binomial", data = MyOrigData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9540 -0.5169 -0.3546 -0.2154 3.0407
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.464629 0.326144 -13.69 <2e-16 ***
## CustServCalls 0.504947 0.038701 13.05 <2e-16 ***
## ContractRenewal -1.979032 0.142307 -13.91 <2e-16 ***
## DataPlan -2.931443 0.224182 -13.08 <2e-16 ***
## MonthlyCharge 0.071188 0.005531 12.87 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2758.3 on 3332 degrees of freedom
## Residual deviance: 2209.8 on 3328 degrees of freedom
## AIC: 2219.8
##
## Number of Fisher Scoring iterations: 5

Checking for multicolinearity between the independent variables


vif(glm(Churn~AccountWeeks+CustServCalls+ContractRenewal+DataPlan+MonthlyChar
ge+DayCalls+OverageFee+RoamMins,data=MyOrigData, family = "binomial"))

## AccountWeeks CustServCalls ContractRenewal DataPlan


## 1.002899 1.081718 1.054632 2.659204
## MonthlyCharge DayCalls OverageFee RoamMins
## 2.880191 1.003920 1.172855 1.031487

Removing of variables
. Dropping the DataUsage and DayMins variable which are redundant and is high cost data
. Instead we have DataPlan and DayCalls that provide the same information. Therefore,
retaining Dataplan and DayCalls as its easier to gather the same
myTrainData = select(MyOrigData,-DataUsage)
myTrainData = select(myTrainData,-DayMins)

table(Churn)

## Churn
## 0 1
## 2850 483

corrplot(cor(myTrainData[sapply(myTrainData, is.numeric)]))
Selecting random seed to reproduce results
set.seed(5)

Sampling 75% of the rows


Sample = createDataPartition(y = myTrainData$Churn, p=0.75, list=FALSE)

Train & Test split as 75% & 25% respectively


. Train dataset contains 2501 observations
. Test dataset cotains 832 observations
train = myTrainData[Sample,]
test = myTrainData[-Sample,]
dim(train)

## [1] 2501 9

dim(test)

## [1] 832 9
Fitting the model
ModelFit = glm(Churn~., data=train, family=binomial)
summary(ModelFit)

##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9296 -0.5154 -0.3607 -0.2183 2.9254
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.357608 0.595565 -8.996 <2e-16 ***
## AccountWeeks 0.001857 0.001595 1.164 0.2444
## ContractRenewal -1.960875 0.163852 -11.967 <2e-16 ***
## DataPlan -2.789980 0.268358 -10.396 <2e-16 ***
## CustServCalls 0.499149 0.044815 11.138 <2e-16 ***
## DayCalls 0.002578 0.003127 0.824 0.4097
## MonthlyCharge 0.064526 0.006834 9.442 <2e-16 ***
## OverageFee 0.022548 0.027352 0.824 0.4097
## RoamMins 0.053895 0.023234 2.320 0.0204 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2071.8 on 2500 degrees of freedom
## Residual deviance: 1667.6 on 2492 degrees of freedom
## AIC: 1685.6
##
## Number of Fisher Scoring iterations: 5

Make Predictions
. Optimizing the model by removing independent variables. After removing the
independent variables from the model we need to make sure the residual deviance does
not increase and AIC value should decrease.
. And then predict the values for testing dataset and predict the accuracy of the model. The
type is set to reponse thatmeans we want the probability of the testing dataset.
ChurnPred = predict(ModelFit, test, type="response")
head(ChurnPred)

## 2 8 10 16 17 19
## 0.05501892 0.11314444 0.30971705 0.77194855 0.13201872 0.02970964
. Looking at the response encoding
contrasts(train$Churn)

## 1
## 0 0
## 1 1

. Converting the probabilities to 1 and 0 as it needs to be factor to print a confusion matrix


GlmPred = rep(0, length(ChurnPred))
GlmPred[ChurnPred > 0.5] = 1
GlmPred <- as.factor(GlmPred)

Confusion matrix
. Since we are dealing with unbalanced data, a check on accuracy is not appropraite and
therefore, we will look at the other measures to check the accuracy of the model
. it looks like the model is performing fairly well. If we were to predict that all results in the
test set were the majority class (1), the accuracy would be 60%
. The response classes are slightly unbalanced.
. The sensitivity, which is a measure of the true positive rate (TP/(TP+FN)), is at 18%.
. The specificity, or true negative rate (TN/(TN+FP)), is 98%.
. This tells us that our model has correctly identified 18% of people that actually churned.
. Positive predicted value is at 60%
. Negative Predicted values is at 88%
confusionMatrix(GlmPred, as.factor(test$Churn), positive = "1")

## Confusion Matrix and Statistics


##
## Reference
## Prediction 0 1
## 0 695 94
## 1 17 26
##
## Accuracy : 0.8666
## 95% CI : (0.8416, 0.889)
## No Information Rate : 0.8558
## P-Value [Acc > NIR] : 0.2018
##
## Kappa : 0.2629
##
## Mcnemar's Test P-Value : 5.449e-13
##
## Sensitivity : 0.21667
## Specificity : 0.97612
## Pos Pred Value : 0.60465
## Neg Pred Value : 0.88086
## Prevalence : 0.14423
## Detection Rate : 0.03125
## Detection Prevalence : 0.05168
## Balanced Accuracy : 0.59640
##
## 'Positive' Class : 1
##

ROC
. Now let us find the threshold using ROC curve. the threshhold value is 0.5.The baseline
performance from random guessing provides a 45 degree line, so we can see that our
model is outperforming random guessing, which is good. The AUC measure is 0.81, which is
greater than 0.5 (baseline model), which is also good.
ResTr = predict(ModelFit,train,type = "response")
Pred = prediction(ResTr,train$Churn)
ROCRperf = performance(Pred,"tpr","fpr")
plot(ROCRperf,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))
AUC
. The AUC can take on any value between 0 and 1, with 1 being the best. This is a easy way
to arrive at the ROC to a single number for evaluating a model. As can be seen our model
has an AUC of 0.81, which is pretty good. If we were to just make random guesses, our ROC
would be a 45 degree line. This would correspond to an AUC of 0.5.
AUCPred = performance(Pred, measure = "auc")
AUCPred = [email protected][[1]]
AUCPred

## [1] 0.8128835

. Let us see if we can further improve the model. we can take a look at the summary of the
fit and identify which features are significant (p-value < 0.05). As can be seen below,
ContractRenewal, Dataplan, MonthlyCharge and CustServCalls appear to be highly
significant
summary(ModelFit)

##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9296 -0.5154 -0.3607 -0.2183 2.9254
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.357608 0.595565 -8.996 <2e-16 ***
## AccountWeeks 0.001857 0.001595 1.164 0.2444
## ContractRenewal -1.960875 0.163852 -11.967 <2e-16 ***
## DataPlan -2.789980 0.268358 -10.396 <2e-16 ***
## CustServCalls 0.499149 0.044815 11.138 <2e-16 ***
## DayCalls 0.002578 0.003127 0.824 0.4097
## MonthlyCharge 0.064526 0.006834 9.442 <2e-16 ***
## OverageFee 0.022548 0.027352 0.824 0.4097
## RoamMins 0.053895 0.023234 2.320 0.0204 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2071.8 on 2500 degrees of freedom
## Residual deviance: 1667.6 on 2492 degrees of freedom
## AIC: 1685.6
##
## Number of Fisher Scoring iterations: 5
sort(varImp(ModelFit), decreasing = TRUE)

## Overall
## ContractRenewal 11.9673220
## CustServCalls 11.1379741
## DataPlan 10.3964674
## MonthlyCharge 9.4415713
## RoamMins 2.3196894
## AccountWeeks 1.1640158
## DayCalls 0.8244680
## OverageFee 0.8243733

Modelfit2 = glm(Churn~ContractRenewal + DataPlan + MonthlyCharge +


CustServCalls + RoamMins + DayCalls, data=train,family=binomial)
ChurnPred2 = predict(Modelfit2, test, type="response")
head(ChurnPred2)

## 2 8 10 16 17 19
## 0.05555871 0.11386086 0.29679246 0.74103784 0.12968470 0.03108179

contrasts(test$Churn)

## 1
## 0 0
## 1 1

GlmPred2 = rep(0, length(ChurnPred2))


GlmPred2[ChurnPred2 > 0.5] = 1
GlmPred2 <- as.factor(GlmPred2)

. The balanced accuracy value hasn’t changed and remains at 59%


. The sensitivity, which is a measure of the true positive rate (TP/(TP+FN)), is now at 20%
. The specificity, or true negative rate (TN/(TN+FP)), has remained at 98%.
. Positive predicted value has remained at 60%
. Negative Predicted values has remained at 88%
. Likely there is a good amount of multicollinearity in the original model with all of the
features.
. From a predictive analysis standpoint, we can see that excluding certain
features/variables that are not significant does not influence the overall performance of the
model
confusionMatrix(GlmPred2, as.factor(test$Churn), positive = "1")

## Confusion Matrix and Statistics


##
## Reference
## Prediction 0 1
## 0 695 95
## 1 17 25
##
## Accuracy : 0.8654
## 95% CI : (0.8403, 0.8879)
## No Information Rate : 0.8558
## P-Value [Acc > NIR] : 0.2312
##
## Kappa : 0.2528
##
## Mcnemar's Test P-Value : 3.443e-13
##
## Sensitivity : 0.20833
## Specificity : 0.97612
## Pos Pred Value : 0.59524
## Neg Pred Value : 0.87975
## Prevalence : 0.14423
## Detection Rate : 0.03005
## Detection Prevalence : 0.05048
## Balanced Accuracy : 0.59223
##
## 'Positive' Class : 1
##

ROC and AUC


. The AUC is nearly unchanged and is at 83%
Pred2 = prediction(ChurnPred2,test$Churn)
ROCRperf2 = performance(Pred2,"tpr","fpr")
plot(ROCRperf2,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))
AUCPred2 = performance(Pred2, measure = "auc")
AUCPred2 = [email protected][[1]]
AUCPred2

## [1] 0.8287687

KNN Modeling
. Caret package provides train() method for training our data for various algorithms. We
just need to pass different parameter values for different algorithms. Before train()
method, we will first use trainControl() method. It controls the computational part of the
train() method.
. We are setting three parameters of trainControl() method. The method parameter holds
the details about resampling method.
. The number parameter holds the number of resampling iterations. The repeats parameter
contains the complete sets of folds to compute for our repeated cross-validation.
. We are using setting number as 10 and repeats as 3. This trainControl() methods returns a
list. We are going to pass this on our train() method.
. Before training our knn classifier, set.seed().
. For training knn classifier, train() method should be passed withmethod parameter as
knn.
. We are passing our target variable Churn. The Chrun~. denotes a formula for using all
attributes in our classifier and Chrun as the target variable. The trControl parameter should
be passed with results from our trianControl() method.
. The preProcess parameter is for preprocessing our training data and preprocessing is a
mandatory task. We are passing 2 values in our preProcess parameter center & scale. These
two help for centering and scaling the data
. After preProcessing these convert our training data with mean value as approximately 0
and standard deviation as 1 The tuneLength parameter holds an integer value. This is for
tuning our algorithm.
trctrl = trainControl(method = "repeatedcv", number = 10, repeats =3)
set.seed(5555)
KNNFit = train(Churn~., data = train, method = "knn",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)

. Its showing Accuracy and Kappa metrics result for different k value. From the results, it
automatically selects best k-value. Here, our training model is choosing k = 9 as its final
value. We can see the variation in the Accuracy with respect to K-value by plotting these in
a graph.
KNNFit

## k-Nearest Neighbors
##
## 2501 samples
## 8 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2251, 2251, 2252, 2250, 2250, 2251, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8811255 0.3740296
## 7 0.8809922 0.3606353
## 9 0.8825842 0.3487018
## 11 0.8801816 0.3194724
## 13 0.8752503 0.2815914
## 15 0.8739170 0.2664852
## 17 0.8733794 0.2480511
## 19 0.8717820 0.2285708
## 21 0.8705863 0.2130558
## 23 0.8703186 0.2109972
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

plot(KNNFit)

. Now, our model is trained with K value as 9. We are ready to predict classes for our test
set. We can use predict() method. caret package provides predict() method for predicting
results. We are passing 2 arguments. The first parameter is our trained model and second
parameter newdata holds our testing data frame. The predict() method returns a list, we
are saving it in a TestPred variable.
TestPred = predict(KNNFit, newdata = test)
TestPred

## [1] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
0
## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
0
## [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0
## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0
## [141] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0
## [176] 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0
## [211] 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [246] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [281] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [316] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [351] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [386] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0
## [421] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0
## [456] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [491] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0
## [526] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [561] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1
## [596] 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0
## [631] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [666] 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [701] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [736] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0
## [771] 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0
0
## [806] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Levels: 0 1

Confusion Matrix
. The balanced accuracy value has gone up by 3% and is now at 63%
. The sensitivity, which is a measure of the true positive rate (TP/(TP+FN)), is now at 29%
with slight increase
. The specificity, or true negative rate (TN/(TN+FP)), has remained at 99%.
. Positive predicted value has gone up by 20% and is at 80%
. Negative Predicted values has remained at 89%
confusionMatrix(TestPred, test$Churn, positive = "1" )
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 703 85
## 1 9 35
##
## Accuracy : 0.887
## 95% CI : (0.8635, 0.9077)
## No Information Rate : 0.8558
## P-Value [Acc > NIR] : 0.004854
##
## Kappa : 0.3787
##
## Mcnemar's Test P-Value : 1.029e-14
##
## Sensitivity : 0.29167
## Specificity : 0.98736
## Pos Pred Value : 0.79545
## Neg Pred Value : 0.89213
## Prevalence : 0.14423
## Detection Rate : 0.04207
## Detection Prevalence : 0.05288
## Balanced Accuracy : 0.63951
##
## 'Positive' Class : 1
##

#ROC and AUC


TestPred =
prediction(as.vector(TestPred,mode="numeric"),as.vector(test$Churn,mode="nume
ric"))
ROCRperf3 = performance(TestPred,"tpr","fpr")
plot(ROCRperf3,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))
. AUC of 63%
AUCPred3 = performance(TestPred, measure = "auc")
AUCPred3 = [email protected][[1]]
AUCPred3

## [1] 0.6395131

Naive Bayes Model


. The balanced accuracy value has gone up by 3% and is now at 63%
. The sensitivity, which is a measure of the true positive rate (TP/(TP+FN)), is at 38% with
this model
. The specificity, or true negative rate (TN/(TN+FP)), is at 92%.
. Positive predicted value is at 45%
. Negative Predicted values is at 90%
str(train)

## 'data.frame': 2501 obs. of 9 variables:


## $ Churn : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ AccountWeeks : int 128 137 84 75 118 121 117 65 74 168 ...
## $ ContractRenewal: int 1 1 0 0 0 1 1 1 1 1 ...
## $ DataPlan : int 1 0 0 0 0 1 0 0 0 0 ...
## $ CustServCalls : int 1 0 2 3 0 3 1 4 0 1 ...
## $ DayCalls : int 110 114 71 113 98 88 97 137 127 96 ...
## $ MonthlyCharge : num 89 52 57 41 57 87.3 63.9 44.9 49.4 31 ...
## $ OverageFee : num 9.87 6.06 3.1 7.42 11.03 ...
## $ RoamMins : num 10 12.2 6.6 10.1 6.3 7.5 8.7 12.7 9.1 11.2 ...

NbModel = naiveBayes(train[,2:9], train[,1])


NBChrPred = predict(NbModel,test[,-1])
str(NBChrPred)

## Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...

length(NBChrPred)

## [1] 832

table(pred=NBChrPred,test[,1])

##
## pred 0 1
## 0 656 74
## 1 56 46

confusionMatrix(NBChrPred,test[,1], positive = "1")

## Confusion Matrix and Statistics


##
## Reference
## Prediction 0 1
## 0 656 74
## 1 56 46
##
## Accuracy : 0.8438
## 95% CI : (0.8173, 0.8678)
## No Information Rate : 0.8558
## P-Value [Acc > NIR] : 0.8498
##
## Kappa : 0.3249
##
## Mcnemar's Test P-Value : 0.1360
##
## Sensitivity : 0.38333
## Specificity : 0.92135
## Pos Pred Value : 0.45098
## Neg Pred Value : 0.89863
## Prevalence : 0.14423
## Detection Rate : 0.05529
## Detection Prevalence : 0.12260
## Balanced Accuracy : 0.65234
##
## 'Positive' Class : 1
##

ROC and AUC


. The AUC is nearly unchanged and is at 81%
str(test$Churn)

## Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...

str(NBChrPred)

## Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...

You might also like