Explatory data Analysis
Explatory data Analysis
Data Analysis
for Loan
Default
Prediction
Exploratory Data Analysis for Loan Default Prediction
Project Objective
The main objective of this project is to analyze loan application data and identify patterns that
influence loan default. By understanding these patterns, the company can minimize financial losses
by making informed decisions about loan approvals.
Key Challenges:
1. If the applicant can repay the loan but is not approved, the company loses business.
2. If the applicant cannot repay the loan and is approved, the company faces financial loss.
o Columns with 10-50% missing values were imputed using median values.
o Columns with less than 10% missing values were imputed using mode or targeted
methods.
Graph: A bar chart showing the percentage of missing data for each column was created.
Code Purpose:
Visualizes the percentage of missing data for each column in the dataset.
Expected Result:
A horizontal bar chart showing the percentage of missing values for each column, sorted in
descending order. Columns with the highest percentage of missing data appear at the top.
B. Outliers Detection
1. Identify Outliers:
o Used the Interquartile Range (IQR) method to detect outliers in numerical columns.
o Visualized outliers using box plots for key numerical variables (e.g., income, loan
amount).
o Detects and visualizes outliers in the AMT_CREDIT column using a box plot.
Expected Result:
o Analyzed the distribution of the target variable (TARGET), which indicates loan
default.
Graph: A pie chart displaying the proportion of defaulted and non-defaulted loans was created.
D. Univariate, Segmented Univariate, and
Bivariate Analysis
1. Univariate Analysis:
o Compared distributions of variables for different scenarios (e.g., defaulted vs. non-
defaulted loans).
3. Bivariate Analysis:
o Explored relationships between variables (e.g., income vs. loan default rate).
Graphs:
Stacked bar charts and scatter plots for segmented and bivariate analysis.
Code Purpose:
Expected Result:
A histogram showing the frequency of income levels. The plot might reveal a skewed
distribution if there are extremely high-income individuals.
Code Purpose:
Expected Result:
A pie chart showing the percentage distribution between the two classes (Non-Default and
Default). For imbalanced data, one class (likely Non-Default) will dominate.
This analysis will guide the company in making informed decisions to minimize financial risks while
optimizing loan approvals. For the visualizations, Excel charts and conditional formatting features
were used to enhance clarity and presentation.
AMT_INCOME_TOTAL
Univariate Analysis
Count 49999 no missing value
Median 145800 middle income
Mode 135000 most frequent income value
Maximum 1.17E+08 an extreme high outlier
Minimum 25650 the lowest income
Std Dev 531813.8 high variability in income
AMT_INCOME_TOTAL
The histogram shared shows the distribution of the AMT_INCOME_TOTAL variable, and from the
appearance, it seems to have a right-skewed distribution.
Interpretation:
1. Right-Skewed Distribution:
o The histogram is highly concentrated at the lower end (close to 0), indicating that
most people in the dataset have lower incomes, while only a small number of
individuals have very high incomes.
o This is typical in income data, where there are many individuals earning below
average and fewer individuals with very high income.
3. Long Tail:
o The long tail on the right side shows the presence of a small number of individuals
with very high incomes. This could indicate outliers or rare cases of extremely high
earnings.
o Because of the right skew, consider log transforming the data to make it more
symmetric for further statistical analysis, such as linear regression, which assumes
normality in the data.
Handling Outliers:
o The outliers on the higher income range (near 5,000,000) may need to be
investigated further. Depending on the context, these could either be extreme
values, errors, or very rare cases. You may want to handle these outliers before
proceeding to other types of analysis.
AMT_INCOME_TOTAL
35000
30000
25000
20000
Frequency
15000
10000
5000
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600
Bin
AMT_CREDIT
Univariate Analysis
Count 49999 no missing value
Median 514777.5 middle credit amount
Mode 450000 most frequent credit amount
Maximum 4050000 highest loan amount
Minimum 45000 lowest loan amount
Std Dev 402411.4 moderate variability
AMT_CREDIT
AMT_CREDIT
14000
12000
10000
8000
Frequency
6000 Frequency
4000
2000
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 00 000 000 000 000 000 000 000 000 000 000 000 000
45 90 135 180 225 270 315 360 405 450 495 540 585
Bin
AMT_ANNUITY
Univariate Analysis
Count 49998 1 missing value
Median 24939 middle annuity amount
Mode 9000 most frequent annuity value
Maximum 258025.5 highest annuity value
Minimum 2052 lowest annuity value
Std Dev 14562.7988 high variability
AMT_ANNUITY
AMT_ANNUITY
4000
3500
3000
2500
Frequency
2000 Frequency
1500
1000
500
0
0 22000 44000 66000 88000 110000132000154000176000198000
Bin
Bivariate Analysis
Target vs AMT_INCOME_TOTAL
Average Income By Default Status
195000
190000
185000
Average Income
180000
175000
170000
165000
160000
155000
Target = 1 Target = 0
Default Status
Interpretation
o The average income for defaulters (TARGET=1) is visibly higher than that of
non-defaulters (TARGET=0).
2. Potential Implications:
o Data Skewness: This could also indicate data outliers or skewness within the
income variable that requires further investigation.
Analysis
1. Check Distribution:
o Consider using a box plot to explore whether outliers in the TARGET=1 group
are influencing the result.
2. Further Investigation:
o Perform segmented analysis, breaking income into ranges (e.g., low, medium,
high) to see default rates within each income group.
E. Correlation Analysis
1. Identify Correlations:
o Calculated correlation coefficients for numerical variables using the CORREL function
in Excel.
o Created a correlation matrix to identify top correlations with the target variable.
o Segmented data into scenarios (e.g., customers with payment difficulties vs. others).
Graph: A heatmap was created to visualize correlations between variables and the target.
Code Purpose:
Expected Result:
A heatmap where:
o Strong correlations are represented by intense colors (red for positive, blue for
negative).
Key Insights
1. Variables like income, loan amount, and credit history are strong predictors of loan default.
2. Customers with incomplete applications or high debt-to-income ratios are more likely to
default.
3. Data imbalance suggests a need for balancing techniques when building predictive models.
Recommended Actions
3. For Modeling: