Predictive Modelling Project - Business Report
Predictive Modelling Project - Business Report
1. Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, multicollinearity check write an inference on it.
2. Data Split: Split the data into test and train, build model Logistic regression,
KNN and Naive Bayes.
3. Performance Metrics: Check the performance of models using Confusion
Matrix.
4. Final Model: Compare all the model and write an inference which model is
best/optimized.
5. Inference: Basis on these predictions, what are the business insights and
recommendations.
Variable Description
Variables
Churn 1 if customer cancelled service, 0 if not
number of weeks customer has had active
Account Weeks account
ContractRenewal 1 if customer recently renewed contract, 0 if not
Data Plan 1 if customer has data plan, 0 if not
Data Usage gigabytes of monthly data usage
CustServCalls number of calls into customer service
DayMins average daytime minutes per month
DayCalls average number of daytime calls
Monthly Charge average monthly bill
OverageFee largest overage fee in last 12 months
RoamMins average number of roaming minutes
Exploratory Data Analysis
1. Check Data Structure
Below results show that there are 11 variables and 3333 observation in the
data, all variables are numeric.
There are 483 Cancelled service out of 3333 cases. Which comes out to be
14.5% Claim ratio.
Bar chart of data plan variable
1. Churn does not seem to be highly corelated with any of the variables.
2. Monthly Charge is also highly correlated with Data Usage, Data Plan and
Day Mins.
Dataset can be free from multicollinearity after removing the variables Monthly
Charge and Data Usage.
All the VIF values are quite low after removing two variables (Monthly Charge
and Data Usage).
Now the multicollinearity is not affecting the dataset.
Churn ratio
Out of 3333 records, we have taken for analysis there are 483 churn cases.
Thus, the churn ratio is 14.5%.
Logistic Regression:
Logistic regression gives best result without irrelevant and correlated variables
i.e. Data Usage and Monthly Charges (analysis from VIF).
KNN Model
Model Comparison
Model Comparison
Parameters Logistic Regression KNN Naïve Bayes
Accuracy 85% 90% 86%
Sensitivity 15% 81% 57%
Specificity 97% 91% 88%
Balanced Accuracy 56% 86% 73%
Total Loss 49% 19% 42%