0% found this document useful (0 votes)

4 views

Explatory data Analysis

Uploaded by

farhashaikh.3535

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Explatory data Analysis

Uploaded by

farhashaikh.3535

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Exploratory

Data Analysis
for Loan
Default
Prediction
Exploratory Data Analysis for Loan Default Prediction

Project Objective

The main objective of this project is to analyze loan application data and identify patterns that
influence loan default. By understanding these patterns, the company can minimize financial losses
by making informed decisions about loan approvals.

Key Challenges:

1. If the applicant can repay the loan but is not approved, the company loses business.

2. If the applicant cannot repay the loan and is approved, the company faces financial loss.

Data Analysis Tasks

A. Missing Data Analysis

1. Identify Missing Data:

o Checked for missing values in all columns.

o Visualized the proportion of missing data using a bar chart.

2. Action Plan for Missing Data:

o Columns with more than 50% missing values were dropped.

o Columns with 10-50% missing values were imputed using median values.

o Columns with less than 10% missing values were imputed using mode or targeted
methods.

Graph: A bar chart showing the percentage of missing data for each column was created.
 Code Purpose:

 Visualizes the percentage of missing data for each column in the dataset.

 Expected Result:

 A horizontal bar chart showing the percentage of missing values for each column, sorted in
descending order. Columns with the highest percentage of missing data appear at the top.
B. Outliers Detection
1. Identify Outliers:

o Used the Interquartile Range (IQR) method to detect outliers in numerical columns.

o Visualized outliers using box plots for key numerical variables (e.g., income, loan
amount).

2. Action Plan for Outliers:

o Analyzed whether outliers were genuine data points or anomalies.

o Retained outliers that represented valid extreme cases; removed anomalies.

Graph: Box plots of numerical variables were used to highlight outliers.

 Code Purpose:

o Detects and visualizes outliers in the AMT_CREDIT column using a box plot.

 Expected Result:

o A box plot where:

 The box represents the interquartile range (IQR).

 Whiskers extend to 1.5 times the IQR.

 Any points outside the whiskers are considered outliers.

C. Data Imbalance Analysis

1. Identify Data Imbalance:

o Analyzed the distribution of the target variable (TARGET), which indicates loan
default.

o Calculated the ratio of defaulted loans to non-defaulted loans.

2. Action Plan for Data Imbalance:

o Highlighted the imbalance in the dataset.

o Suggested techniques such as oversampling or undersampling for modeling.

Graph: A pie chart displaying the proportion of defaulted and non-defaulted loans was created.
D. Univariate, Segmented Univariate, and
Bivariate Analysis
1. Univariate Analysis:

o Analyzed individual variables to understand their distribution.

o For numerical variables, used histograms to examine distributions (e.g., income

levels, loan amounts).

o For categorical variables, used bar charts (e.g., loan purposes).

2. Segmented Univariate Analysis:

o Compared distributions of variables for different scenarios (e.g., defaulted vs. non-
defaulted loans).

o Used pivot tables to calculate averages or medians within segments.

3. Bivariate Analysis:

o Explored relationships between variables (e.g., income vs. loan default rate).

o Used scatter plots and cross-tabulations to highlight correlations.

Graphs:

 Histograms and bar charts for univariate analysis.

 Stacked bar charts and scatter plots for segmented and bivariate analysis.
 Code Purpose:

 Explores the distribution of AMT_INCOME_TOTAL to understand its central tendency, spread,

and skewness.

 Expected Result:

 A histogram showing the frequency of income levels. The plot might reveal a skewed
distribution if there are extremely high-income individuals.
 Code Purpose:

 Visualizes the proportion of loan default (TARGET=1) versus non-default (TARGET=0).

 Expected Result:

 A pie chart showing the percentage distribution between the two classes (Non-Default and
Default). For imbalanced data, one class (likely Non-Default) will dominate.
This analysis will guide the company in making informed decisions to minimize financial risks while
optimizing loan approvals. For the visualizations, Excel charts and conditional formatting features
were used to enhance clarity and presentation.

AMT_INCOME_TOTAL
Univariate Analysis
Count 49999 no missing value
Median 145800 middle income
Mode 135000 most frequent income value
Maximum 1.17E+08 an extreme high outlier
Minimum 25650 the lowest income
Std Dev 531813.8 high variability in income
AMT_INCOME_TOTAL

The histogram shared shows the distribution of the AMT_INCOME_TOTAL variable, and from the
appearance, it seems to have a right-skewed distribution.

Interpretation:

1. Right-Skewed Distribution:

o The histogram is highly concentrated at the lower end (close to 0), indicating that
most people in the dataset have lower incomes, while only a small number of
individuals have very high incomes.

o This is typical in income data, where there are many individuals earning below
average and fewer individuals with very high income.

2. Peak at Lower Values:

o The significant frequency at low-income levels (e.g., near 30,000) suggests that the
majority of the dataset lies within these ranges. This is important when considering
statistical models, as the majority of the data points will be clustered in the lower
ranges.

3. Long Tail:

o The long tail on the right side shows the presence of a small number of individuals
with very high incomes. This could indicate outliers or rare cases of extremely high
earnings.

What can be done with this distribution:

 Log Transformation:

o Because of the right skew, consider log transforming the data to make it more
symmetric for further statistical analysis, such as linear regression, which assumes
normality in the data.

 Handling Outliers:

o The outliers on the higher income range (near 5,000,000) may need to be
investigated further. Depending on the context, these could either be extreme
values, errors, or very rare cases. You may want to handle these outliers before
proceeding to other types of analysis.
AMT_INCOME_TOTAL
35000

30000

25000

20000
Frequency

15000

10000

5000

0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600

Bin

AMT_CREDIT
Univariate Analysis
Count 49999 no missing value
Median 514777.5 middle credit amount
Mode 450000 most frequent credit amount
Maximum 4050000 highest loan amount
Minimum 45000 lowest loan amount
Std Dev 402411.4 moderate variability
AMT_CREDIT
AMT_CREDIT
14000

12000

10000

8000
Frequency

6000 Frequency

4000

2000

0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 00 000 000 000 000 000 000 000 000 000 000 000 000
45 90 135 180 225 270 315 360 405 450 495 540 585

Bin

AMT_ANNUITY

Univariate Analysis
Count 49998 1 missing value
Median 24939 middle annuity amount
Mode 9000 most frequent annuity value
Maximum 258025.5 highest annuity value
Minimum 2052 lowest annuity value
Std Dev 14562.7988 high variability
AMT_ANNUITY
AMT_ANNUITY
4000
3500
3000
2500
Frequency

2000 Frequency
1500
1000
500
0
0 22000 44000 66000 88000 110000132000154000176000198000
Bin

Bivariate Analysis
Target vs AMT_INCOME_TOTAL
Average Income By Default Status
195000
190000
185000
Average Income

180000
175000
170000
165000
160000
155000
Target = 1 Target = 0
Default Status

Interpretation

1. Higher Average Income for Defaulters:

o The average income for defaulters (TARGET=1) is visibly higher than that of
non-defaulters (TARGET=0).

o This is counterintuitive since higher income is often expected to correlate

with lower default rates.

2. Potential Implications:

o Overextension of Credit: Higher-income individuals may default due to

overestimating their ability to repay larger loans.

o External Factors: High-income defaulters might face external economic

pressures (e.g., business losses, unexpected expenses) that are unrelated to
income levels.

o Data Skewness: This could also indicate data outliers or skewness within the
income variable that requires further investigation.
Analysis
1. Check Distribution:

o Consider using a box plot to explore whether outliers in the TARGET=1 group
are influencing the result.

o Look for significant spread or extreme values in both groups.

2. Further Investigation:

o Perform segmented analysis, breaking income into ranges (e.g., low, medium,
high) to see default rates within each income group.

o Examine additional variables (e.g., loan amount, credit duration) to identify

other contributing factors.

To analyze AMT_CREDIT (loan amount) versus AMT_INCOME_TOTAL (income), we’ll perform a

bivariate analysis to study their relationship. This analysis can help identify trends, such as whether
higher-income individuals take larger loans or if there’s any clear trend related to loan size and
income.

E. Correlation Analysis

1. Identify Correlations:

o Calculated correlation coefficients for numerical variables using the CORREL function
in Excel.
o Created a correlation matrix to identify top correlations with the target variable.

2. Segmented Correlation Analysis:

o Segmented data into scenarios (e.g., customers with payment difficulties vs. others).

o Highlighted the strongest indicators of default for each segment.

Graph: A heatmap was created to visualize correlations between variables and the target.

 Code Purpose:

 Displays the pairwise correlation of numerical variables in the dataset.

 Expected Result:
 A heatmap where:

o Strong correlations are represented by intense colors (red for positive, blue for
negative).

o Weak correlations are represented by neutral colors (white or light tones)

Key Insights

1. Variables like income, loan amount, and credit history are strong predictors of loan default.

2. Customers with incomplete applications or high debt-to-income ratios are more likely to
default.

3. Data imbalance suggests a need for balancing techniques when building predictive models.

Recommended Actions

1. For High-Risk Applicants:

o Reduce loan amounts or charge higher interest rates.

o Request additional documentation for creditworthiness.

2. For Low-Risk Applicants:

o Streamline the approval process to enhance customer experience.

3. For Modeling:

o Use balanced datasets to improve model accuracy.

o Focus on highly correlated variables for feature selection.

Graphs and Visualizations

1. Bar Chart: Missing data percentages for each column.

2. Box Plots: Outliers in numerical variables.

3. Pie Chart: Proportion of defaulted vs. non-defaulted loans.

4. Histograms: Distribution of numerical variables.

5. Heatmap: Correlation between numerical variables and the target.

Credit EDA Assignment
67% (6)
Credit EDA Assignment
41 pages
EDA Credit Case Study (Karan Pratap Singh)
100% (1)
EDA Credit Case Study (Karan Pratap Singh)
63 pages
Capstone Project - Credit Risk Analysis
67% (6)
Capstone Project - Credit Risk Analysis
50 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
Advanced Portfolio Management: A Quant's Guide for Fundamental Investors
From Everand
Advanced Portfolio Management: A Quant's Guide for Fundamental Investors
Giuseppe A. Paleologo
No ratings yet
EDA Loan Case Study PPT - Ver 1.1
80% (5)
EDA Loan Case Study PPT - Ver 1.1
22 pages
EDA Assignment
100% (1)
EDA Assignment
19 pages
Credit EDA Case Study
100% (3)
Credit EDA Case Study
22 pages
Summary and Context
No ratings yet
Summary and Context
51 pages
Presentation 20 1
No ratings yet
Presentation 20 1
48 pages
Credit EDA Case Study
No ratings yet
Credit EDA Case Study
42 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
34 pages
EDA Credit Assignment Shakti - PDF
No ratings yet
EDA Credit Assignment Shakti - PDF
51 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
11 pages
Bank Loan Case - Study
100% (1)
Bank Loan Case - Study
21 pages
Trainity Data Analytics Training Project 6
No ratings yet
Trainity Data Analytics Training Project 6
22 pages
trainity-data an
No ratings yet
trainity-data an
24 pages
Bank Loan Case Study Report
No ratings yet
Bank Loan Case Study Report
23 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
13 pages
Ass 06 - Bank Loan Case Study
No ratings yet
Ass 06 - Bank Loan Case Study
11 pages
Credit EDA Assignment PDF
No ratings yet
Credit EDA Assignment PDF
40 pages
Bank_Loan_ppt
No ratings yet
Bank_Loan_ppt
45 pages
Bank Loan Case Study 2
No ratings yet
Bank Loan Case Study 2
23 pages
1 PPPP
No ratings yet
1 PPPP
26 pages
Credit EDA Case Study
100% (3)
Credit EDA Case Study
16 pages
Credit EDA Case Study Doc 1
100% (1)
Credit EDA Case Study Doc 1
16 pages
EDA Group Case Study
No ratings yet
EDA Group Case Study
33 pages
EDA Case Study
No ratings yet
EDA Case Study
94 pages
Bank Loan Case Study PRO 6 1
No ratings yet
Bank Loan Case Study PRO 6 1
24 pages
Business Analytics
No ratings yet
Business Analytics
56 pages
Eda Case Study Final PDF
100% (1)
Eda Case Study Final PDF
15 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
2 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
22 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
26 pages
6- Bank Loan Analysis
No ratings yet
6- Bank Loan Analysis
10 pages
Finance Risk Analytics - Priyanka Sharma - Business Report
No ratings yet
Finance Risk Analytics - Priyanka Sharma - Business Report
49 pages
Credit EDA Case Study
No ratings yet
Credit EDA Case Study
19 pages
Assignment2 Stats
No ratings yet
Assignment2 Stats
5 pages
Detail Project Report SMDM
100% (1)
Detail Project Report SMDM
25 pages
EDA Assignment
No ratings yet
EDA Assignment
33 pages
Credit Card EDA: Authored by
100% (1)
Credit Card EDA: Authored by
16 pages
Group 5 Dseb64a Report
No ratings yet
Group 5 Dseb64a Report
10 pages
Machine Learning Paper BD
No ratings yet
Machine Learning Paper BD
16 pages
Ba Cia3
No ratings yet
Ba Cia3
33 pages
Part A
No ratings yet
Part A
16 pages
Capstone Project - Final Submission
No ratings yet
Capstone Project - Final Submission
36 pages
Capstone_Project
No ratings yet
Capstone_Project
33 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
No ratings yet
Project - Finance and Risk Assessment: Submitted By: Navendu Mishra
18 pages
FRA Report
100% (1)
FRA Report
30 pages
Week 4 LAB
No ratings yet
Week 4 LAB
26 pages
Credit - Eda Case Study: Mr. Murali Krishna Manala Ms. Prachi Patil
100% (1)
Credit - Eda Case Study: Mr. Murali Krishna Manala Ms. Prachi Patil
22 pages
Thera Bank PRJ
100% (10)
Thera Bank PRJ
79 pages
LendingClubCaseStudy 1
No ratings yet
LendingClubCaseStudy 1
19 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Reading Material - Module-5 - Introduction To Special Topics
No ratings yet
Reading Material - Module-5 - Introduction To Special Topics
27 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
LDA CreditCardDefault Code N
No ratings yet
LDA CreditCardDefault Code N
11 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Azki Task Solution- Afshin Amiri
No ratings yet
Azki Task Solution- Afshin Amiri
7 pages
2 Descriptive Statistics Handout
No ratings yet
2 Descriptive Statistics Handout
2 pages
Lesson 5 Measures of Central Tendency
No ratings yet
Lesson 5 Measures of Central Tendency
30 pages
Fall 2023-2024 IE 451 Homework 2 Solutions
No ratings yet
Fall 2023-2024 IE 451 Homework 2 Solutions
20 pages
3 Metstat
No ratings yet
3 Metstat
2 pages
Session-8 BRM PDF
No ratings yet
Session-8 BRM PDF
18 pages
Measures of Variation
No ratings yet
Measures of Variation
10 pages
Matrices: Name Command Factor Covariance Matrix Inspect (Fit, "Coefficients") $psi
No ratings yet
Matrices: Name Command Factor Covariance Matrix Inspect (Fit, "Coefficients") $psi
2 pages
Statistics and Computer: Tools For Analyzing of Assessment Data
No ratings yet
Statistics and Computer: Tools For Analyzing of Assessment Data
10 pages
P&S Assignement 2
No ratings yet
P&S Assignement 2
2 pages
Math 8 4th Grading Exam
No ratings yet
Math 8 4th Grading Exam
2 pages
NCERT Solutions For Class 10 Maths Chapter 14 - Statistics Exercise 14.3
No ratings yet
NCERT Solutions For Class 10 Maths Chapter 14 - Statistics Exercise 14.3
10 pages
Output Hasil Spss
No ratings yet
Output Hasil Spss
7 pages
6691 01 Que 20060619
No ratings yet
6691 01 Que 20060619
5 pages
WORKSHEET 6 Data Description PDF
No ratings yet
WORKSHEET 6 Data Description PDF
3 pages
Matlab-Median and Mode
100% (1)
Matlab-Median and Mode
12 pages
Alevel Stats Mech 1 Solutionbank Combined
No ratings yet
Alevel Stats Mech 1 Solutionbank Combined
255 pages
Descriptive Statistics: Amit K Biswas
No ratings yet
Descriptive Statistics: Amit K Biswas
240 pages
General Foundry: Project Management Precedences 3 Time Estimates
No ratings yet
General Foundry: Project Management Precedences 3 Time Estimates
4 pages
Formula Card
100% (1)
Formula Card
13 pages
Dynamic Histogram Chart
No ratings yet
Dynamic Histogram Chart
6 pages
Selvanathan 6e - 19 - PPT
No ratings yet
Selvanathan 6e - 19 - PPT
72 pages
MPS Form2
No ratings yet
MPS Form2
112 pages
Mean Median Mode Range PDF
No ratings yet
Mean Median Mode Range PDF
1 page
Assignment
No ratings yet
Assignment
7 pages
Assignment #1 - Mutual Fund Data
No ratings yet
Assignment #1 - Mutual Fund Data
29 pages
M6 Check in Activity 4 Group
No ratings yet
M6 Check in Activity 4 Group
10 pages
Unit 12 - Simple Correlation and Regression
No ratings yet
Unit 12 - Simple Correlation and Regression
35 pages
Activity (Measures of Position For Ungrouped Data)
100% (2)
Activity (Measures of Position For Ungrouped Data)
6 pages
Chapter 1: Some Basic Statistical Concepts: 1. The Language of Statistics
No ratings yet
Chapter 1: Some Basic Statistical Concepts: 1. The Language of Statistics
28 pages

Explatory data Analysis

Uploaded by

Explatory data Analysis

Uploaded by

Exploratory

Data Analysis Tasks

A. Missing Data Analysis

o Checked for missing values in all columns.

o Visualized the proportion of missing data using a bar chart.

2. Action Plan for Missing Data:

o Columns with more than 50% missing values were dropped.

2. Action Plan for Outliers:

o Analyzed whether outliers were genuine data points or anomalies.

o Retained outliers that represented valid extreme cases; removed anomalies.

Graph: Box plots of numerical variables were used to highlight outliers.

o A box plot where:

 The box represents the interquartile range (IQR).

 Whiskers extend to 1.5 times the IQR.

 Any points outside the whiskers are considered outliers.

C. Data Imbalance Analysis

o Calculated the ratio of defaulted loans to non-defaulted loans.

2. Action Plan for Data Imbalance:

o Highlighted the imbalance in the dataset.

o Suggested techniques such as oversampling or undersampling for modeling.

o Analyzed individual variables to understand their distribution.

o For numerical variables, used histograms to examine distributions (e.g., income

o For categorical variables, used bar charts (e.g., loan purposes).

2. Segmented Univariate Analysis:

o Used pivot tables to calculate averages or medians within segments.

o Used scatter plots and cross-tabulations to highlight correlations.

 Histograms and bar charts for univariate analysis.

 Explores the distribution of AMT_INCOME_TOTAL to understand its central tendency, spread,

 Visualizes the proportion of loan default (TARGET=1) versus non-default (TARGET=0).

2. Peak at Lower Values:

What can be done with this distribution:

1. Higher Average Income for Defaulters:

o This is counterintuitive since higher income is often expected to correlate

o Overextension of Credit: Higher-income individuals may default due to

o External Factors: High-income defaulters might face external economic

o Look for significant spread or extreme values in both groups.

o Examine additional variables (e.g., loan amount, credit duration) to identify

To analyze AMT_CREDIT (loan amount) versus AMT_INCOME_TOTAL (income), we’ll perform a

2. Segmented Correlation Analysis:

o Highlighted the strongest indicators of default for each segment.

 Displays the pairwise correlation of numerical variables in the dataset.

o Weak correlations are represented by neutral colors (white or light tones)

1. For High-Risk Applicants:

o Reduce loan amounts or charge higher interest rates.

o Request additional documentation for creditworthiness.

2. For Low-Risk Applicants:

o Streamline the approval process to enhance customer experience.

o Use balanced datasets to improve model accuracy.

o Focus on highly correlated variables for feature selection.

Graphs and Visualizations

1. Bar Chart: Missing data percentages for each column.

2. Box Plots: Outliers in numerical variables.

4. Histograms: Distribution of numerical variables.

5. Heatmap: Correlation between numerical variables and the target.

You might also like