0% found this document useful (0 votes)
145 views

Credit EDA Assignment PDF

This document summarizes an analysis of loan application data to help a consumer finance company identify patterns to better determine which applicants are likely to repay loans. The analysis includes identifying missing data and outliers, assessing data imbalance, and univariate, bivariate and segmented univariate analysis. Key findings are that over 92% of applicants did not default, most defaulters are male with medium income and education levels, and that repeat clients who have not previously defaulted are less likely to default. The analysis aims to help the company approve more qualified applicants and avoid losses from approving high-risk applicants unlikely to repay loans.

Uploaded by

Alisha Anand
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views

Credit EDA Assignment PDF

This document summarizes an analysis of loan application data to help a consumer finance company identify patterns to better determine which applicants are likely to repay loans. The analysis includes identifying missing data and outliers, assessing data imbalance, and univariate, bivariate and segmented univariate analysis. Key findings are that over 92% of applicants did not default, most defaulters are male with medium income and education levels, and that repeat clients who have not previously defaulted are less likely to default. The analysis aims to help the company approve more qualified applicants and avoid losses from approving high-risk applicants unlikely to repay loans.

Uploaded by

Alisha Anand
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Credit EDA Assignment

BY PANKAJ KUMAR
Problem Statement I

 The loan providing companies find it hard to give loans to the people
due to their insufficient or non-existent credit history. Because of that,
some consumers use it as their advantage by becoming a defaulter.
Suppose you work for a consumer finance company which specialises
in lending various types of loans to urban customers. You have to use
EDA to analyse the patterns present in the data. This will ensure that the
applicants capable of repaying the loan are not rejected.
 When the company receives a loan application, the company has to
decide for loan approval based on the applicant’s profile. Two types of
risks are associated with the bank’s decision:
 If the applicant is likely to repay the loan, then not approving the loan
results in a loss of business to the company
 If the applicant is not likely to repay the loan, i.e. he/she is likely to
default, then approving the loan may lead to a financial loss for the
company.
Problem Statement - II

 Present the overall approach of the analysis in a presentation. Mention


the problem statement and the analysis approach briefly.
 Identify the missing data and use appropriate method to deal with it.
(Remove columns/or replace it with an appropriate value)
 Hint: Note that in EDA, since it is not necessary to replace the missing
value, but if you have to replace the missing value, what should be the
approach. Clearly mention the approach.
 Identify if there are outliers in the dataset. Also, mention why do you
think it is an outlier. Again, remember that for this exercise, it is not
necessary to remove any data points.
 Identify if there is data imbalance in the data. Find the ratio of data
imbalance.
 Explain the results of univariate, segmented univariate, bivariate
analysis, etc. in business terms.
Assumptions and steps taken

 From both the data set i.e. application_data.csv and


previous_application_data.csv columns with more than 40% of
missing values are dropped
 In application_data.csv below steps taken for categorical variable
 NAME_TYPE_SUITE imputed with mode() value, for NaN
 CODE_GENDER imputed with mode() where it was "XNA“
 ORGANIZATION_TYPE imputed with mode() where it was "XNA“
 OCCUPATION_TYPE have many missing values. It is left as it is. No records
dropped.
Assumptions and steps taken continued
 In previous_application_data.csv below
steps taken for categorical variable
 NAME_CONTRACT_TYPE is imputed with mode where it was "XNA"
 NAME_CLIENT_TYPE is imputed with mode where it was "XNA"
 NAME_CASH_LOAN_PURPOSE majority of records have value "XNA" and "XNP". No steps
taken for these.
 NAME_PAYMENT_TYPE majority of records have value "XNA". No steps taken for these.
 CODE_REJECT_REASON majority of records have value "XNA" and "XNP". No steps taken for
these.
 NAME_CLIENT_TYPE is imputed with mode where it was "XNA"
 NAME_GOODS_CATEGORY majority of records have value "XNA". No steps taken for these.
 NAME_PORTFOLIO majority of records have value "XNA". No steps taken for these.
 NAME_PRODUCT_TYPE majority of records have value "XNA". No steps taken for these.
 NAME_SELLER_INDUSTRY majority of records have value "XNA". No steps taken for these.
 NAME_YIELD_GROUP majority of records have value "XNA". No steps taken for these.
Data imbalance in the data

 Its clear that there is an imbalance between people who defaulted


and who didn't default. More than 92% of people didn't default as
opposed to 8% who defaulted.
Income Range Distribution for Male
and Female
Income Type distribution for Male
and Female
Distribution of family Status of
Defaulters
Defaulters and Non-Defaulters
based on age group
Defaulters and Non-Defaulters
based on Income
Defaulters and Non-Defaulters ratio
of Male and Female
Defaulters and Non-Defaulters
Based on Education
Distribution of
Organization
Type for Non-
Defaulters
Number of applicants respect to
family member count
Bivariate Analysis
Education Vs Income
Credit Amount Vs Education for
Target 0
Top 10 Correlation for Target Value 0
Top 10 Correlation for Target Value 1
Previous Application Data
Univariate Analysis
Previous Application Data
Univariate Analysis continued……
Previous Application Data
Univariate Analysis continued……
Previous Application Data
Univariate Analysis continued……
Previous Application data Bivariate
Analysis
Previous Application data Bivariate
Analysis continued ….
Top 10 correlation in Previous
Application Data
Heat Map for Previous Application
Correlation
Merged data Analysis
Ignoring XNA and XNP, we can observe below
points from above Graph

1. Repairs has highest volume of loan Approved


as well as loan Refused
2. Other category have most Unused load status
3. Where purpose is "Refusal to name the goal"
bank has refused more than the approved
4. Paying other loans and buying a new car is
having significant higher rejection than approves.
Points to observe from Below graph
1. Income Group Medium and High have almost same ratio of Loan
Approve, Canceled, Refused, Unused Offer
2. Income Group Very low have same ratio of Canceled and Refused loan
status
3. Income group VeryLow has highest approved to unused offer ratio
Impact on Loan Status if owing a
car or not in Merged data
Loan Status by Age Group in
Merged Data
Client Type for Defaulters and Non-
defaulters
Defaulters VS Non-defaulters for
INCOME_TYPE in merged data
Relationship between Goods Price
and AMT_Credit for Target 0
AMT_CREDIT vs AMT_ANNUITY for
Merged Data Target 0
Conclusion
 Bank Should Focus on All Age Group
 Bank Should focus on All Education
 Bank is getting Most business from INCOME_TYPE working, but the
same time this group contributes most in Defaulters
 Channel “Credit and Cash Officiers” has acquired highest volume
of clients.
 From previous application data we observed Repeaters clients are
most, so bank should keep targeting its Non-defaulters client.
 Bank has more Female clients than the Male
 Most applicants have 3 Family members
 So bank should focus on clients with any age, education and
Having salary range High and Medium with Family member 3
Thank You !!!

You might also like