0% found this document useful (0 votes)
4 views

Explatory data Analysis

Uploaded by

farhashaikh.3535
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Explatory data Analysis

Uploaded by

farhashaikh.3535
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Exploratory

Data Analysis
for Loan
Default
Prediction
Exploratory Data Analysis for Loan Default Prediction

Project Objective

The main objective of this project is to analyze loan application data and identify patterns that
influence loan default. By understanding these patterns, the company can minimize financial losses
by making informed decisions about loan approvals.

Key Challenges:

1. If the applicant can repay the loan but is not approved, the company loses business.

2. If the applicant cannot repay the loan and is approved, the company faces financial loss.

Data Analysis Tasks

A. Missing Data Analysis


1. Identify Missing Data:

o Checked for missing values in all columns.

o Visualized the proportion of missing data using a bar chart.

2. Action Plan for Missing Data:

o Columns with more than 50% missing values were dropped.

o Columns with 10-50% missing values were imputed using median values.

o Columns with less than 10% missing values were imputed using mode or targeted
methods.

Graph: A bar chart showing the percentage of missing data for each column was created.
 Code Purpose:

 Visualizes the percentage of missing data for each column in the dataset.

 Expected Result:

 A horizontal bar chart showing the percentage of missing values for each column, sorted in
descending order. Columns with the highest percentage of missing data appear at the top.
B. Outliers Detection
1. Identify Outliers:

o Used the Interquartile Range (IQR) method to detect outliers in numerical columns.

o Visualized outliers using box plots for key numerical variables (e.g., income, loan
amount).

2. Action Plan for Outliers:

o Analyzed whether outliers were genuine data points or anomalies.

o Retained outliers that represented valid extreme cases; removed anomalies.

Graph: Box plots of numerical variables were used to highlight outliers.


 Code Purpose:

o Detects and visualizes outliers in the AMT_CREDIT column using a box plot.

 Expected Result:

o A box plot where:

 The box represents the interquartile range (IQR).

 Whiskers extend to 1.5 times the IQR.

 Any points outside the whiskers are considered outliers.

C. Data Imbalance Analysis


1. Identify Data Imbalance:

o Analyzed the distribution of the target variable (TARGET), which indicates loan
default.

o Calculated the ratio of defaulted loans to non-defaulted loans.

2. Action Plan for Data Imbalance:

o Highlighted the imbalance in the dataset.

o Suggested techniques such as oversampling or undersampling for modeling.

Graph: A pie chart displaying the proportion of defaulted and non-defaulted loans was created.
D. Univariate, Segmented Univariate, and
Bivariate Analysis
1. Univariate Analysis:

o Analyzed individual variables to understand their distribution.

o For numerical variables, used histograms to examine distributions (e.g., income


levels, loan amounts).

o For categorical variables, used bar charts (e.g., loan purposes).

2. Segmented Univariate Analysis:

o Compared distributions of variables for different scenarios (e.g., defaulted vs. non-
defaulted loans).

o Used pivot tables to calculate averages or medians within segments.

3. Bivariate Analysis:

o Explored relationships between variables (e.g., income vs. loan default rate).

o Used scatter plots and cross-tabulations to highlight correlations.

Graphs:

 Histograms and bar charts for univariate analysis.

 Stacked bar charts and scatter plots for segmented and bivariate analysis.
 Code Purpose:

 Explores the distribution of AMT_INCOME_TOTAL to understand its central tendency, spread,


and skewness.

 Expected Result:

 A histogram showing the frequency of income levels. The plot might reveal a skewed
distribution if there are extremely high-income individuals.
 Code Purpose:

 Visualizes the proportion of loan default (TARGET=1) versus non-default (TARGET=0).

 Expected Result:

 A pie chart showing the percentage distribution between the two classes (Non-Default and
Default). For imbalanced data, one class (likely Non-Default) will dominate.
This analysis will guide the company in making informed decisions to minimize financial risks while
optimizing loan approvals. For the visualizations, Excel charts and conditional formatting features
were used to enhance clarity and presentation.

AMT_INCOME_TOTAL
Univariate Analysis
Count 49999 no missing value
Median 145800 middle income
Mode 135000 most frequent income value
Maximum 1.17E+08 an extreme high outlier
Minimum 25650 the lowest income
Std Dev 531813.8 high variability in income
AMT_INCOME_TOTAL

The histogram shared shows the distribution of the AMT_INCOME_TOTAL variable, and from the
appearance, it seems to have a right-skewed distribution.

Interpretation:

1. Right-Skewed Distribution:

o The histogram is highly concentrated at the lower end (close to 0), indicating that
most people in the dataset have lower incomes, while only a small number of
individuals have very high incomes.

o This is typical in income data, where there are many individuals earning below
average and fewer individuals with very high income.

2. Peak at Lower Values:


o The significant frequency at low-income levels (e.g., near 30,000) suggests that the
majority of the dataset lies within these ranges. This is important when considering
statistical models, as the majority of the data points will be clustered in the lower
ranges.

3. Long Tail:

o The long tail on the right side shows the presence of a small number of individuals
with very high incomes. This could indicate outliers or rare cases of extremely high
earnings.

What can be done with this distribution:


 Log Transformation:

o Because of the right skew, consider log transforming the data to make it more
symmetric for further statistical analysis, such as linear regression, which assumes
normality in the data.

 Handling Outliers:

o The outliers on the higher income range (near 5,000,000) may need to be
investigated further. Depending on the context, these could either be extreme
values, errors, or very rare cases. You may want to handle these outliers before
proceeding to other types of analysis.
AMT_INCOME_TOTAL
35000

30000

25000

20000
Frequency

15000

10000

5000

0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600

Bin

AMT_CREDIT
Univariate Analysis
Count 49999 no missing value
Median 514777.5 middle credit amount
Mode 450000 most frequent credit amount
Maximum 4050000 highest loan amount
Minimum 45000 lowest loan amount
Std Dev 402411.4 moderate variability
AMT_CREDIT
AMT_CREDIT
14000

12000

10000

8000
Frequency

6000 Frequency

4000

2000

0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 00 000 000 000 000 000 000 000 000 000 000 000 000
45 90 135 180 225 270 315 360 405 450 495 540 585

Bin

AMT_ANNUITY

Univariate Analysis
Count 49998 1 missing value
Median 24939 middle annuity amount
Mode 9000 most frequent annuity value
Maximum 258025.5 highest annuity value
Minimum 2052 lowest annuity value
Std Dev 14562.7988 high variability
AMT_ANNUITY
AMT_ANNUITY
4000
3500
3000
2500
Frequency

2000 Frequency
1500
1000
500
0
0 22000 44000 66000 88000 110000132000154000176000198000
Bin

Bivariate Analysis
Target vs AMT_INCOME_TOTAL
Average Income By Default Status
195000
190000
185000
Average Income

180000
175000
170000
165000
160000
155000
Target = 1 Target = 0
Default Status

Interpretation

1. Higher Average Income for Defaulters:

o The average income for defaulters (TARGET=1) is visibly higher than that of
non-defaulters (TARGET=0).

o This is counterintuitive since higher income is often expected to correlate


with lower default rates.

2. Potential Implications:

o Overextension of Credit: Higher-income individuals may default due to


overestimating their ability to repay larger loans.

o External Factors: High-income defaulters might face external economic


pressures (e.g., business losses, unexpected expenses) that are unrelated to
income levels.

o Data Skewness: This could also indicate data outliers or skewness within the
income variable that requires further investigation.
Analysis
1. Check Distribution:

o Consider using a box plot to explore whether outliers in the TARGET=1 group
are influencing the result.

o Look for significant spread or extreme values in both groups.

2. Further Investigation:

o Perform segmented analysis, breaking income into ranges (e.g., low, medium,
high) to see default rates within each income group.

o Examine additional variables (e.g., loan amount, credit duration) to identify


other contributing factors.

To analyze AMT_CREDIT (loan amount) versus AMT_INCOME_TOTAL (income), we’ll perform a


bivariate analysis to study their relationship. This analysis can help identify trends, such as whether
higher-income individuals take larger loans or if there’s any clear trend related to loan size and
income.

E. Correlation Analysis

1. Identify Correlations:

o Calculated correlation coefficients for numerical variables using the CORREL function
in Excel.
o Created a correlation matrix to identify top correlations with the target variable.

2. Segmented Correlation Analysis:

o Segmented data into scenarios (e.g., customers with payment difficulties vs. others).

o Highlighted the strongest indicators of default for each segment.

Graph: A heatmap was created to visualize correlations between variables and the target.

 Code Purpose:

 Displays the pairwise correlation of numerical variables in the dataset.

 Expected Result:
 A heatmap where:

o Strong correlations are represented by intense colors (red for positive, blue for
negative).

o Weak correlations are represented by neutral colors (white or light tones)

Key Insights

1. Variables like income, loan amount, and credit history are strong predictors of loan default.

2. Customers with incomplete applications or high debt-to-income ratios are more likely to
default.

3. Data imbalance suggests a need for balancing techniques when building predictive models.

Recommended Actions

1. For High-Risk Applicants:

o Reduce loan amounts or charge higher interest rates.

o Request additional documentation for creditworthiness.

2. For Low-Risk Applicants:

o Streamline the approval process to enhance customer experience.

3. For Modeling:

o Use balanced datasets to improve model accuracy.

o Focus on highly correlated variables for feature selection.

Graphs and Visualizations

1. Bar Chart: Missing data percentages for each column.

2. Box Plots: Outliers in numerical variables.


3. Pie Chart: Proportion of defaulted vs. non-defaulted loans.

4. Histograms: Distribution of numerical variables.

5. Heatmap: Correlation between numerical variables and the target.

You might also like