0% found this document useful (0 votes)
432 views36 pages

Capstone Project - Final Submission

The document describes a capstone project to build a credit risk assessment model using past applicant data from CredX, a leading credit card provider. The team will follow the CRISP-DM methodology, including data understanding, cleaning, exploration, model building, evaluation, and building a scorecard. The data includes demographic and credit bureau information on over 70,000 applicants. Data cleaning involved handling missing values, outliers, and inconsistencies. The final model will help the company acquire customers with lower credit risk and assess the financial benefits.

Uploaded by

anoop k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
432 views36 pages

Capstone Project - Final Submission

The document describes a capstone project to build a credit risk assessment model using past applicant data from CredX, a leading credit card provider. The team will follow the CRISP-DM methodology, including data understanding, cleaning, exploration, model building, evaluation, and building a scorecard. The data includes demographic and credit bureau information on over 70,000 applicants. Data cleaning involved handling missing values, outliers, and inconsistencies. The final model will help the company acquire customers with lower credit risk and assess the financial benefits.

Uploaded by

anoop k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Capstone Project

FINAL SUBMISSION

Group Name:
1. Anoop K
2. Ketan Koul
3. Manish Muralidharan
Objective

CredX is a leading credit card provider that gets thousands of credit card applicants every year. But in the
past few years, the company has experienced an increase in credit loss. The CEO believes that the best
strategy to mitigate credit risk is to acquire the right customers.

Using past data of the bank's applicants, a model is to be built to determine the factors affecting credit risk,
create strategies to mitigate the acquisition risk and assess the financial benefit of the project.
Problem Solving Methodology

The problem is approached using R language. CRISP-DM Framework is followed for the problem.
Following are the steps involved in the process.

1. Business & Data Understanding


2. Data Cleaning & Preparation
3. Exploratory Data Analysis (EDA)
4. Data Transformation & Model Building
5. Model Evaluation
6. Building of Application Scorecard
7. Analysis of Financial Benefits of the Final Model
Overview: Data Understanding
There are two data sets in this project — Demographic data and Credit Bureau data.  

• Demographic Data: This is obtained from the information provided by the applicants at the time of
credit card application. It contains customer-level information on age, gender, income, marital status, etc.
This contains a total of 11 features.
• Credit Bureau Data: This is taken from the credit bureau and contains variables such as 'number of
times 30 DPD or worse in last 3/6/12 months', 'outstanding balance', 'number of trades', etc. This data set
contains a total of 18 features.

Both the files contain 71295 rows of data and have a common variable called Application ID.

Both files also contain a variable called Performance Tag which represents whether the applicant has gone
90 days past due or worse in the past 12-months (i.e. defaulted) after getting a credit card.
Data Cleaning & Preparation
Demographics Data
Sl No Variables Description Discrepancy Possible Data Correction

Though the data is different for each of these


duplicated row, since the number is insignificant as
compared to the entire dataset, these three rows can
1 Application ID Unique ID of the customers There are three duplicated rows in column. be deleted.

As seen from the Datafile Summary, the range of values


is from -3 to 65. There are 88 records in which the age
is less than 18 years old. It is not possible to have a
negative age and since it would require at least 18 All rows with age less than 18 is replaced with 45
2 Age Age of customer years to available any loan product. which is both the Mean and Median value.

The gender distribution is as below: Males – 54456,


3 Gender Gender of customer Females – 16837, NA – 2. NA values replaced by Males

Marital status of customer (at the time of The Marital Status distribution is as below: Married –
4 Marital Status application) 60730, Single – 10559, NA – 6. NA values replaced by Married

The variable has values between 1 to 5. Also, there are


5 No of dependents No. of childrens of customers 3 rows with value NA. NA values replaced by Median value of 3

Values range from -0.50 to 60. There are 107 rows Since income cannot be a negative value or 0, the rows
6 Income Income of customers where the income is 0 or less. containing negative values or 0 can be ignored.
Data Cleaning & Preparation
Demographics Data
Sl No Variables Description Discrepancy Possible Data Correction

Education has different values like Bachelor, Masters,


Phd, Professional & Others. It also has 118 rows with
7 Education Education of customers value as NA. Rows with NA is replaced by Others.

Values are Salaried, Self Employed & Self Employed


8 Profession Profession of customers Professionals. 14 rows have NA for value NA values replaced by SALARIED

Values are Rented, Owned, Company Provided, Living


9 Type of residence Type of residence of customers with Parents, Others. Also, 8 rows have NAs. NA values replaced by RENTED

No of months in Values range from 6 to 126 with no discrepancies in the


10 current residence No of months in current residence of customers data. NA

No of months in Values range from 3 to 133 with no discrepancies in the


11 current company No of months in current company of customers data. NA

Status of customer performance (" 1 represents Values are either 0, 1 and NA. NA values represent Wherever the value is NA, such rows will be copied to
12 Performance Tag "Default") rejected applicants another dataset and used for scorecard validation.
Data Cleaning & Preparation
Credit Bureau Data
Sl No Variables Description Discrepancy Possible Data Correction

Though the data is different for each of these duplicated


There are three duplicated rows row, since the number is insignificant as compared to the
1 Application ID Customer application ID in column. entire dataset, these three rows can be deleted.

No of times 90 DPD or worse in Number of times customer has not payed dues since 90days in
2 last 6 months last 6 months No Missing/Incorrect values NA

No of times 60 DPD or worse in Number of times customer has not payed dues since 60 days last
3 last 6 months 6 months No Missing/Incorrect values NA

No of times 30 DPD or worse in Number of times customer has not payed dues since 30 days
4 last 6 months days last 6 months No Missing/Incorrect values NA

No of times 90 DPD or worse in Number of times customer has not payed dues since 90 days
5 last 12 months days last 12 months No Missing/Incorrect values NA

No of times 60 DPD or worse in Number of times customer has not payed dues since 60 days
6 last 12 months days last 12 months No Missing/Incorrect values NA

No of times 30 DPD or worse in Number of times customer has not payed dues since 30 days
7 last 12 months days last 12 months No Missing/Incorrect values NA

Avgas CC Utilization in last 12


8 months Average utilization of credit card by customer 1058 rows with value as NA NA values replaced by Mean Value of 30

No of trades opened in last 6 Number of times the customer has done the trades in last 6 NA value replaced by 2 which is both the Mean and
9 months months 1 row with NA value Median Value

No of trades opened in last 12 Number of times the customer has done the trades in last 12
10 months months No Missing/Incorrect values NA
Data Cleaning & Preparation
Credit Bureau Data
Sl No Variables Description Discrepancy Possible Data Correction

No of PL trades opened in last


11 6 months No of PL trades in last 6 month of customer No Missing/Incorrect values NA
No of PL trades opened in last
12 12 months No of PL trades in last 12 month of customer No Missing/Incorrect values NA
No of Inquiries in last 6 months Number of times the customers has inquired in last 6
13 (excluding home & auto loans) months No Missing/Incorrect values NA
No of Inquiries in last 12
months (excluding home & Number of times the customers has inquired in last 12
14 auto loans) months No Missing/Incorrect values NA
There are 272 rows with Rows with NA replaced by 0 which is the Mean and
15 Presence of open home loan Is the customer has home loan (1 represents "Yes") values as NA Median value.
There are 272 rows with
16 Outstanding Balance Outstanding balance of customer values as NA Rows with NA is removed.

17 Total No of Trades Number of times the customer has done total trades No Missing/Incorrect values NA

18 Presence of open auto loan Is the customer has auto loan (1 represents "Yes") No Missing/Incorrect values NA
Values are either 0, 1 and NA.
NA values represent rejected Wherever the value is NA, such rows will be copied
19 Performance Tag Status of customer performance (" 1 represents "Default") applicants to another dataset and used for scorecard validation.
Data Cleaning & Preparation
Outlier Treatment using Boxplots

• There are few outliers in Age with negatives


values and 0. These can be treated as invalid
values and be replaced with median value.
• Though there are few outliers in No of
months in current company, this can be a
valid case.
Data Cleaning & Preparation
Outlier Treatment using Boxplots

• There are few records where average credit


card utilization is greater than 100 percent
which is a valid case.
• No outliers on the other attributes.
Exploratory Data Analysis
To have a better understanding of the data we are working with, we will look at the data summary, and
perform Univariate and Bivariate Analysis on different combinations of the variables.

We will also explore the Correlation between various variables.

Ideal way to understand the relationship between variables and quality of customer would be through
plotting the relationship.
Exploratory Data Analysis

• Irrespective of the no of dependents, the distribution of defaulters is fairly even.


• For Age, we can see that age range between 40-50 has highest defaults.
Exploratory Data Analysis

• Marital status, Education and Gender seems to show a trend in


number of defaults.
Exploratory Data Analysis

• Residence, profession and income seems to show a trend in number of defaults.


Exploratory Data Analysis
As seen from the plot, as the
number of occurrences of non
payment increases over 30, 60
or 90 days, the default chances
also increases. The same is
observed in the case of 12
months as well, as shown in the
next slide.
Exploratory Data Analysis
Plot for last 12 months
Exploratory Data Analysis

Correlation Plot for Merged Dataset

• Income and home loan has negative


correlation with all the other
attributes
• Outstanding balances has a positive
correlation between number of trades
(PL and others)
• No of days past due between 90, 60
and 30 in last 6 and 12 months has
strong positive correlation between
each other
• No of Trades and inquiries in last 6
months and 12 months has strong
positive correlation between each
other.
Exploratory Data Analysis
Income has negative correlation
with the no of inquiries. Greater
than 3 inquiries for higher salary
segment and greater than 4 for
lower income segment needs to
monitored where the default
cases are high

Income has negative correlation No variations in


as it was shown in correlation outstanding balance as
matrix. Here the DPD cases the income increases.
decreases as the income increases There is no trend
for non default cases.
Exploratory Data Analysis
As evident from the plot, there
is a direct positive correlation
between the Avg CC Utilization
and Total No of Trade Lines.

Income has negative correlation with Avg


As evident from the plot, there is CC Utilization. As income increases, the CC
a direct positive correlation utilization decreases for non default cases.
between the Avg CC Utilization Avg CC Utilization is greater than 35 for
and No of 90+ DPD in past 12 lower income and greater than 25 for
months. higher income are the cases that needs to
be monitored for default.
Exploratory Data Analysis
WOE Analysis
In Credit Risk Analytics, the Weight of Evidence (WoE) and Information Value (IV) analysis is often used to identify
the important variables. The weight of evidence tells the predictive power of an independent variable in relation
to the dependent variable. IV measures the strength of that relationship. Apart from assessing variable
importance, WOE can also used to impute missing values from the data.

Using the RIV package, we will create the WoE table for all the variables and determine how important each
variable is in determining whether a customer will default or not. For analysis purpose, we will replace the actual
values of all the variables by the corresponding WOE value and store the data in a separate file (e.g. woe_data)
for further analysis.
Below table shows the IV values and their predictive powers.

Information Value Predictive Power


< 0.02 Useless for prediction
0.02 to 0.1 Weak predictor
0.1 to 0.3 Medium predictor
0.3 to 0.5 Strong predictor
>0.5 Suspicious or too good to be true
Exploratory Data Analysis
WOE Analysis

IV Plot Most Significant Variables


Variable IV
Avg.CC.Utl.12M 3.11646E-01
Trades.12M 2.988926E-01
PL.Trades.12M 2.971392E-01
Inq.12M.Excl.Home.Auto.Loan 2.961256E-01
Outstanding.Balance 2.456360E-01
X30DPD.6M 2.416778E-01
Total.No.of.Trades 2.373514E-01
PL.Trades.6M 2.199074E-01
X90DPD.12M 2.138818E-01
X60DPD.6M 2.058284E-01
Inq.6M.Excl.Home.Auto.Loan 2.048113E-01
Model Building
For this Project, we will use Logistic Regression & Random Forest methods to build our Predictive Models.
Firstly, to understand how an Applicant’s demographic details affect his creditworthiness, we will build a Logistic
Regression model on the Demographics dataset. From this, we can observe how the various variables like age, gender,
income source & type, etc determine someone’s probability of default.
Later, we’ll merge both files on the basis of the Application ID. While merging, only such datarows that exist in both
the files are considered. We have a total of 69488 valid rows in the final dataset for analysis. The final model will be
built on this dataset. This file is named Bankfile. For analysis purpose, we will replace the actual values of all the
variables by the corresponding WOE value and store the data in a separate file and build a model on the WOE
replaced data and see the predictive power of this model.
For the purpose of building a Predictive Model, we will split the dataset into Training and Test data in the ratio of
70:30. We will start by building a Model based on the entire set of variables. We will then use the StepAIC function on
the first model build to remove the insignificant variables. The final model obtained after StepAIC is further iterated
based on the p Values and VIF values while keeping a check on AIC factor. This model is then evaluated using the Test
Dataset and the fine tuned to get the best predictive model.
From all the models we've build using different means, we will choose the model giving the best combination of
Accuracy, Sensitivity and Specificity.
The Final Underwriting model will have only the most significant variables which are not correlated to other variables
in the dataset.
Model Building
For this Project, we will use Logistic Regression, Decision Tree & Random Forest methods to build our Predictive
Models.
Firstly, to understand how an Applicant’s demographic details affect his creditworthiness, we will build a Logistic
Regression model on the Demographics dataset. From this, we can observe how the various variables like age, gender,
income source & type, etc determine someone’s probability of default.
Later, we’ll merge both files on the basis of the Application ID. While merging, only such datarows that exist in both
the files are considered. We have a total of 69488 valid rows in the final dataset for analysis. The final model will be
built on this dataset. This file is named Bankfile. For analysis purpose, we will replace the actual values of all the
variables by the corresponding WOE value and store the data in a separate file and build a model on the WOE
replaced data and see the predictive power of this model.
From all the models we've build using different means, we will choose the model giving the best combination of
Accuracy, Sensitivity and Specificity.
The Final Underwriting model will have only the most significant variables which are not correlated to other variables
in the dataset.
Model Building
Process Involved in Model Building

1.) Data Cleansing and Outlier Treatment


2.) Scaling Data and Creation of Dummy Variables
3.) Data Split: Data is split into Train and Test Datasets in the ratio 70:30/
4.) Data Sampling: Since the data subset will be highly random and imbalance, we used sampling techniques to
balance the mix of data. For this project, we use the ROSE package for sampling
5.) Step AIC Function is used on the first model build to remove the insignificant variables.
6.) Correlation: The final model obtained after StepAIC is further iterated based on the p Values and VIF values while
keeping a check on AIC factor. This further removes insignificant and correlated variables.
7.) Model is ran on the Test Dataset and the Predictive power of the model is evaluated.
8.) Optimal Cut-off value for Probability is found to fine tune the Accuracy, Sensitivity and Specificity of the model.
Model Evaluation
Confusion Matrix and Statistics
Model Type: Logistic Regression Model on Demographics Data Alone

Reference
Variables in the Final Model
1. Income
Prediction No Yes
2. Residence.Stability No 10808 393
3. Office.Stability
Yes 9157 488
4. Profession.xSE

Accuracy 0.5419
Sensitivity 0.5413
Specificity 0.5539
Model Evaluation
Confusion Matrix and Statistics
Model Type: Logistic Regression Model on Merged Dataset

Reference
Variables in the Final Model
Prediction No Yes
1. Avg.CC.Utl.12M
2. X30DPD.6M No 12494 328
3. X90DPD.12M Yes 7471 553
4. X60DPD.12M
5. X30DPD.12M Cut-Off Value 0.512
6. PL.Trades.6M Accuracy 62.59%
7. PL.Trades.12M
8. Inq.12M.Excl.Home.Auto.Loan
Sensitivity 62.77%
9. Home.Loan Specificity 62.58%
Model Evaluation
Confusion Matrix and Statistics
Model Type: Logistic Regression Model on Merged Dataset
Model Evaluation
Confusion Matrix and Statistics
Model Type: Logistic Regression Model on Merged Dataset with WOE Data

Reference
Variables in the Final Model
Prediction No Yes
1. Avg.CC.Utl.12M
2. X30DPD.6M No 12447 322
3. X90DPD.12M Yes 7518 559
4. X60DPD.12M
5. X30DPD.12M Cut-Off Value 0.522
6. PL.Trades.6M Accuracy 62.29%
7. PL.Trades.12M
8. Inq.12M.Excl.Home.Auto.Loan
Sensitivity 63.45%
9. Home.Loan Specificity 62.34%
Model Evaluation
Confusion Matrix and Statistics
Model Type: Decision Tree Model on Merged Dataset

Reference
Variables in the Final Model
Prediction No Yes
1. X30DPD.6M No 13880 394
2. X90DPD.12M Yes 6085 487

Cut-Off Value 0.0025


Accuracy 68.92%
Sensitivity 55.28%
Specificity 69.52%
Model Evaluation
Confusion Matrix and Statistics
Model Type: Random Forest Model on Merged Dataset

Reference
Prediction No Yes
No 12454 342
Yes 7511 539

Cut-Off Value 0.43


Accuracy 62.33%
Sensitivity 61.18%
Specificity 62.38%
Model Evaluation
Comparison of Various Models and Choice of Final Model
Comparison of Various Models and Choice of Final Model]
Model Type Dataset Type Accuracy Sensitivity Specificity
Logistic Regression Demographics Data 54.19% 54.13% 55.39%
Logistic Regression Merged Data 62.59% 62.77% 62.58%
Logistic Regression Merged Data wit WOE Binning 62.29% 63.45% 62.34%
Decision Tree Merged Data 68.92% 55.28% 69.52%
Randon Forest Merged Data 62.33% 61.18% 62.38%
Model Evaluation
Comparison of Various Models and Choice of Final Model
Comparison of Various Models and Choice of Final Model]

Based on the evaluation of the various models we build, we’ve chosen the Logistic Regression Model
on WOE Binned Data as our Final Model for building CredX’s Underwriting Scorecard.

As a part of further analysis the performance of this model, we will run this final model on the Rejected
Applicants’ data.

Based on this data, 52% of the applicants who were rejected by CredX has been rejected by the new
Underwriting Model too.
CredX Credit Card Scorecard

CredX’s new Credit Card Underwriting Scorecard was built on Logistic Regression model built on WOE Binned Data.

The following was the process followed for the Scorecard creation:

1.) The good to bad odds is 10 to 1 at a score of 400


2.) The odd doubles at every 20 points.
3.) Scorecard range was calculated and found to be between 306 and 367
4.) Cutoff Score was calculated on basis of the Probability Cutoff for the Final Model (0.522)
5.) Cutoff Score for CredX Scorecard came at 331.
CredX Credit Card Scorecard

Scorecard Evaluation on Full Dataset

• The Scorecard was run on the Rejected Applicants and the range was found to be between 288 and 483
• 674 Applicants in the list of 1425 Rejected Applicants cleared the cut-off as per the new Scorecard. This is 47.30% of
the total Rejected Applicants.
• The Final cut off score was found to be 331.
Financial Benefit of Project

• Total Loss due to approval of Defaulters = 1092*100000 = 10,92,00,000 after the model is applied on to
dataset
• Loss without any model was 29,37,00,000. We were able to reduce the loss by 18,45,00,000/-
Thank You

You might also like