Loan Approval Model Prediction
Loan Approval Model Prediction
Submitted by
DATA ANALYTICS
Department of Mathematics and Computing
Synopsis: LOANS are the major requirement of the modern world. By this only, Banks get a
major part of the total profit. It is beneficial for students to manage their education and living
expenses, and for people to buy any kind of luxury like houses, cars, etc. But when it comes to
deciding whether the applicant’s profile is relevant to be granted with loan or not. Banks have to
look after many aspects.
We are going to develop one such model that can predict whether a person will get his/her loan
approved or not by using some of the background information of the applicant like the applicant’s
gender, marital status, income, etc.
1 Loan A unique id
2 Gender Gender of the applicant Male/female
3 Married Marital Status of the applicant, values will be Yes/ No
4 Dependents It tells whether the applicant has any dependents or not.
5 Education It will tell us whether the applicant is Graduated or not.
6 Self-Employed This defines that the applicant is self-employed i.e. Yes/ No
7 Applicant Income Applicant income
8 Coapplicant Income Co-applicant income
9 Loan Amount Loan amount (in thousands)
10 Loan_Amount_Term Terms of loan (in months)
11 Credit_History Credit history of individual’s repayment of their debts
12 Property_Area Area of property i.e. Rural/Urban/Semi-urban
13 Loan_Status Status of Loan Approved or not i.e. Y- Yes, N-No
Importing Libraries
Pandas: To load the Data frame
Matplotlib: To visualize the data features i.e. bar plot
Seaborn: To see the correlation between features using heat map
Data Cleaning: Clean the data to handle missing values, outliers, and inconsistencies.
This step is crucial for the model's accuracy and generalization. We may need to impute
missing values, standardize or normalize features, and deal with any data anomalies.
Outlier: An outlier is a data point that significantly deviates from the other data points in
a dataset. Outliers can be unusually high or low values and can distort statistical analyses
and model training.
Data Visualization: Exploratory data analysis is performed using visualizations like
count plots and box plots to gain insights into the distribution and relationships between
features.
The code analyzes loan applications by gender, providing the frequency of each gender and
visualizing the distribution with a count plot. This analysis reveals that there are significantly more
male applicants than female applicants seeking loans.
Visualize all the unique values in columns using bar plot. This will simply show which value is
dominating per our dataset.
The next step that involves creating a heat map to visualize correlation typically belongs to
the Exploratory Data Analysis (EDA) phase in a machine learning project.
Exploratory Data Analysis (EDA) focuses on understanding the data, identifying patterns,
relationships, and potential issues before proceeding with model building.
The code calculates the correlation between numerical features in the loan application
dataset and visualizes it using a heat map. The heat map reveals the strength and direction
of relationships, with darker shades indicating stronger correlations. Positive correlations
are shown in blue shades, negative correlations in lighter shades, allowing for insights into
feature dependencies.
Data Splitting: It
prepares data for both
model training and
evaluation. The project
aims to predict loan
approval status using
machine learning models
trained on a dataset split
into 75% for training and 25% for testing.
test_size=0.25 means 25% of the data is allocated for testing.
The Decision Tree model provides a visual representation of the loan approval prediction process,
highlighting key features and decision rules. By analyzing the tree structure, feature importance,
and decision paths, valuable insights can be gained into the factors influencing loan approval
decisions. This interpretability is a significant advantage of Decision Trees, allowing for better
understanding and transparency in the prediction process.
The KNN model achieved an accuracy of 79.87% in predicting loan approvals. While
demonstrating good overall performance, it shows a slightly lower recall for rejected loans
(0.41 for class 0). Further optimization of the 'k' neighbor’s parameter may improve
performance. The weighted average F1-score of 0.78 suggests a decent overall performance,
making it a potential candidate for loan approval prediction.
Conclusion
This project explored various machine learning models to predict loan approval status
using a provided dataset. I investigated algorithms like Logistic Regression, Decision Tree,
Random Forest, K-Nearest Neighbors, and XGBoost. Each model was trained, evaluated
using metrics such as accuracy, confusion matrix, and classification report, and compared
with others.
While all models demonstrated reasonable performance, XGBoost emerged as a strong
contender with high accuracy and robust predictions. The Decision Tree provided valuable
insights into feature importance and decision rules through its interpretable visualization.
Further improvements could be explored through hyperparameter tuning, feature
engineering, and addressing class imbalance. This project provides a solid foundation for
developing a reliable loan approval prediction system, enabling faster and more informed
decision-making in the loan application process.