Capstone Project - Final Submission
Capstone Project - Final Submission
FINAL SUBMISSION
Group Name:
1. Anoop K
2. Ketan Koul
3. Manish Muralidharan
Objective
CredX is a leading credit card provider that gets thousands of credit card applicants every year. But in the
past few years, the company has experienced an increase in credit loss. The CEO believes that the best
strategy to mitigate credit risk is to acquire the right customers.
Using past data of the bank's applicants, a model is to be built to determine the factors affecting credit risk,
create strategies to mitigate the acquisition risk and assess the financial benefit of the project.
Problem Solving Methodology
The problem is approached using R language. CRISP-DM Framework is followed for the problem.
Following are the steps involved in the process.
• Demographic Data: This is obtained from the information provided by the applicants at the time of
credit card application. It contains customer-level information on age, gender, income, marital status, etc.
This contains a total of 11 features.
• Credit Bureau Data: This is taken from the credit bureau and contains variables such as 'number of
times 30 DPD or worse in last 3/6/12 months', 'outstanding balance', 'number of trades', etc. This data set
contains a total of 18 features.
Both the files contain 71295 rows of data and have a common variable called Application ID.
Both files also contain a variable called Performance Tag which represents whether the applicant has gone
90 days past due or worse in the past 12-months (i.e. defaulted) after getting a credit card.
Data Cleaning & Preparation
Demographics Data
Sl No Variables Description Discrepancy Possible Data Correction
Marital status of customer (at the time of The Marital Status distribution is as below: Married –
4 Marital Status application) 60730, Single – 10559, NA – 6. NA values replaced by Married
Values range from -0.50 to 60. There are 107 rows Since income cannot be a negative value or 0, the rows
6 Income Income of customers where the income is 0 or less. containing negative values or 0 can be ignored.
Data Cleaning & Preparation
Demographics Data
Sl No Variables Description Discrepancy Possible Data Correction
Status of customer performance (" 1 represents Values are either 0, 1 and NA. NA values represent Wherever the value is NA, such rows will be copied to
12 Performance Tag "Default") rejected applicants another dataset and used for scorecard validation.
Data Cleaning & Preparation
Credit Bureau Data
Sl No Variables Description Discrepancy Possible Data Correction
No of times 90 DPD or worse in Number of times customer has not payed dues since 90days in
2 last 6 months last 6 months No Missing/Incorrect values NA
No of times 60 DPD or worse in Number of times customer has not payed dues since 60 days last
3 last 6 months 6 months No Missing/Incorrect values NA
No of times 30 DPD or worse in Number of times customer has not payed dues since 30 days
4 last 6 months days last 6 months No Missing/Incorrect values NA
No of times 90 DPD or worse in Number of times customer has not payed dues since 90 days
5 last 12 months days last 12 months No Missing/Incorrect values NA
No of times 60 DPD or worse in Number of times customer has not payed dues since 60 days
6 last 12 months days last 12 months No Missing/Incorrect values NA
No of times 30 DPD or worse in Number of times customer has not payed dues since 30 days
7 last 12 months days last 12 months No Missing/Incorrect values NA
No of trades opened in last 6 Number of times the customer has done the trades in last 6 NA value replaced by 2 which is both the Mean and
9 months months 1 row with NA value Median Value
No of trades opened in last 12 Number of times the customer has done the trades in last 12
10 months months No Missing/Incorrect values NA
Data Cleaning & Preparation
Credit Bureau Data
Sl No Variables Description Discrepancy Possible Data Correction
17 Total No of Trades Number of times the customer has done total trades No Missing/Incorrect values NA
18 Presence of open auto loan Is the customer has auto loan (1 represents "Yes") No Missing/Incorrect values NA
Values are either 0, 1 and NA.
NA values represent rejected Wherever the value is NA, such rows will be copied
19 Performance Tag Status of customer performance (" 1 represents "Default") applicants to another dataset and used for scorecard validation.
Data Cleaning & Preparation
Outlier Treatment using Boxplots
Ideal way to understand the relationship between variables and quality of customer would be through
plotting the relationship.
Exploratory Data Analysis
Using the RIV package, we will create the WoE table for all the variables and determine how important each
variable is in determining whether a customer will default or not. For analysis purpose, we will replace the actual
values of all the variables by the corresponding WOE value and store the data in a separate file (e.g. woe_data)
for further analysis.
Below table shows the IV values and their predictive powers.
Reference
Variables in the Final Model
1. Income
Prediction No Yes
2. Residence.Stability No 10808 393
3. Office.Stability
Yes 9157 488
4. Profession.xSE
Accuracy 0.5419
Sensitivity 0.5413
Specificity 0.5539
Model Evaluation
Confusion Matrix and Statistics
Model Type: Logistic Regression Model on Merged Dataset
Reference
Variables in the Final Model
Prediction No Yes
1. Avg.CC.Utl.12M
2. X30DPD.6M No 12494 328
3. X90DPD.12M Yes 7471 553
4. X60DPD.12M
5. X30DPD.12M Cut-Off Value 0.512
6. PL.Trades.6M Accuracy 62.59%
7. PL.Trades.12M
8. Inq.12M.Excl.Home.Auto.Loan
Sensitivity 62.77%
9. Home.Loan Specificity 62.58%
Model Evaluation
Confusion Matrix and Statistics
Model Type: Logistic Regression Model on Merged Dataset
Model Evaluation
Confusion Matrix and Statistics
Model Type: Logistic Regression Model on Merged Dataset with WOE Data
Reference
Variables in the Final Model
Prediction No Yes
1. Avg.CC.Utl.12M
2. X30DPD.6M No 12447 322
3. X90DPD.12M Yes 7518 559
4. X60DPD.12M
5. X30DPD.12M Cut-Off Value 0.522
6. PL.Trades.6M Accuracy 62.29%
7. PL.Trades.12M
8. Inq.12M.Excl.Home.Auto.Loan
Sensitivity 63.45%
9. Home.Loan Specificity 62.34%
Model Evaluation
Confusion Matrix and Statistics
Model Type: Decision Tree Model on Merged Dataset
Reference
Variables in the Final Model
Prediction No Yes
1. X30DPD.6M No 13880 394
2. X90DPD.12M Yes 6085 487
Reference
Prediction No Yes
No 12454 342
Yes 7511 539
Based on the evaluation of the various models we build, we’ve chosen the Logistic Regression Model
on WOE Binned Data as our Final Model for building CredX’s Underwriting Scorecard.
As a part of further analysis the performance of this model, we will run this final model on the Rejected
Applicants’ data.
Based on this data, 52% of the applicants who were rejected by CredX has been rejected by the new
Underwriting Model too.
CredX Credit Card Scorecard
CredX’s new Credit Card Underwriting Scorecard was built on Logistic Regression model built on WOE Binned Data.
The following was the process followed for the Scorecard creation:
• The Scorecard was run on the Rejected Applicants and the range was found to be between 288 and 483
• 674 Applicants in the list of 1425 Rejected Applicants cleared the cut-off as per the new Scorecard. This is 47.30% of
the total Rejected Applicants.
• The Final cut off score was found to be 331.
Financial Benefit of Project
• Total Loss due to approval of Defaulters = 1092*100000 = 10,92,00,000 after the model is applied on to
dataset
• Loss without any model was 29,37,00,000. We were able to reduce the loss by 18,45,00,000/-
Thank You