Final Project Report - Kelompok 4
Final Project Report - Kelompok 4
The solution that our team proposed consist of two major process. The first process is creating a robust
machine learning model with high accuracy and the second major process is to deploy the model in a
virtual web. The detailed solution work flow could be seen on figure 1.
Our solution start with data set research. We found the data set from kaggle about customer credit data
set that is quite complex and consist a lot of data points [1]. Once the data set is load, then we will search
through each column of the data set to make sure that we understand what each column represent. The
data set itself consist of 12 columns and 32581 rows.
On the process, we found the data set has some missing value in few columns. The column which has a
missing value is the customer employment length and the loan intention rate. The percentage of missing
value for both columns are 2,74% and 9,5% respectively. For the column that has a missing value lower
than 5%, we drop the data as it is the common practice in the data science community [2]. For the
columns which has missing value grater than 5%, we impute the missing values with the mean values
of the columns since the column is not evenly distributed.
ML Model Build Up
After data cleaning process has been completed, we preprocessed the categorical values in the data set.
In total there are 4 columns that has a categorical value and is then being processed by using label
encoding (for unique value greater than 2) and binary encoding (for unique value equal to 2). Straight
after preprocessed the categorical value, Exploratory Data Analysis (EDA) is done to detect outliers in
the data set. We found that some columns may have an extreme outlier which was handled by using
capping method [3].
Selection of feature is also crucial to developed the ML model. Some articles have warned to see the
correlation between each column and to make sure none of this columns is highly correlated
(Multicollinearity Check) [4]. This would effect the accuracy of the model if highly correlated values
was not handled correctly. Luckly in our data set there is no feature that has a high correlation value.
The next step after selecting the feature and getting our data set cleaned, we deploy various machine
learning algorithms and evaluate which algotihms give the best accurate value. This step would be
discussed in detail on section 2.2 and for the deployment would be discussed in section 2.3. For the
tools, we used google colaboratory for ML model build up (red dashed line) by utilizing pandas, numpy,
matplotlib, and sklearn libarary. HTML, CSS, and Visual Code with Python Flask library is used for the
deployment (blue dashed line).
2.2. Machine Learning Model Evaluation
To deploy the correct machine learning algortihm, model selection is performed by fitting few
algorithms to the data set and it is evaluated with AUC Score since it is a classification problem. Few
algotihms that we fit to the data set are Random Forest Classifier, Decision Tree Classifier, KNN
Classifier, and Logistic Regression.
From figure 2, it can be concluded that Random Forest Classifier is the algorithm that produced the best
score. Thus, we will evaluate the model by using confusion matrix (figure 3) to see how well the model
predict the customer creditworthy. Overall, the model prediction gives a good accuracy with a precision
score of 0,97 for 1 (Bad loan customer) and 0,92 for 0 (Good loan customer). Thus, this model will be
used for the deployment.
For deploying the model, we used the reference that we’ve found on youtube considering the minimum
time for the project [5]. The deployment flow was quite simple. It starts with saving the model into a
pickle file which then will be loaded to the web page using the Flask API library from Python. Figure 4
shows the user interface of the web while figure 5 and 6 showing the user interface when the predictor
values has been inputted.
The predicted benefit of changing the system of review for loan is that the review time needed for each
customer can be reduced significantly and in the same time reducing the workforce needed to manual
review. Reduce in expenses for manual labor, and increase in income as the time needed to review is
shortens will be expected after using the model.
Beside the more efficient operations, predicting the right customer will also gaining more revenue to the
company. From the data set we could now that the bad customer loan amount average is equal $10760.
Thu, by using our model with precision score of 0.97, the company could save more than $1 million
every 100 customer and instead gaining more revenue by giving the amount to a more promising
customer.
Github: https://github.com/AndoniFikri/Credit-Risk-Prediction-with-Deployment
3. References