FRA Project Report Milestone 1 PDF
FRA Project Report Milestone 1 PDF
(Milestone -I)
Table of Contents
Contents
Introduction ....................................................................................................................................................... 3
1.4 Univariate (4 marks) & Bivariate ( 6marks) analysis with proper interpretation.
(You may choose to include only those variables which were significant in the model
building) …………………………..……………………………………………………….……13
1.6 Build Logistic Regression Model (using statsmodel library) on most important variables
on Train Dataset and choose the optimum cutoff. Also showcase your model building
approach ……….………….……………….………….……………….………….……….…… 22
1.7 Validate the Model on Test Dataset and state the performance matrices. Also state
1|Page
List of Figures Problem -I Predicting Credit Risk Page No.
2|Page
Predicting Credit Risk
Problem – I
Executive Summary
Businesses or companies can fall prey to default if they are not able to keep up their debt obligations. Defaults
will lead to a lower credit rating for the company which in turn reduces its chances of getting credit in the future
and may have to pay higher interests on existing debts as well as any new obligations. From an investor's point
of view, he would want to invest in a company if it is capable of handling its financial obligations, can grow
quickly, and is able to manage the growth scale.
A balance sheet is a financial statement of a company that provides a snapshot of what a company owns, owes,
and the amount invested by the shareholders. Thus, it is an important tool that helps evaluate the performance
of a business.
Data that is available includes information from the financial statement of the companies for the previous year
(2015). Also, information about the Networth of the company in the following year (2016) is provided which
can be used to drive the labeled field.
Explanation of data fields available in Data Dictionary, 'Credit Default Data Dictionary.xlsx'
We need to create a default variable that should take the value of 1 when net worth next year is negative & 0
when net worth next year is positive.
Introduction
This assignment helps us to perform Outlier Treatment, Missing Value Treatment, Transform Target variable
into 0 and 1, Univariate and Bivariate Analysis, Split data into Train & Test, Model Building Logistic Regression
on most important variables on Train Dataset and choose the optimum cutoff, Model Validation is to be done
on Test Dataset.
We have 3586 entries and 67 columns. The outcome of this assignment will suggesting investors good credit
rating companies to invest their money.
3|Page
Data Description
4|Page
S.No Field Name Description
28 Revenue expenses in forex Expenses due to foreign currency transactions
29 Capital expenses in forex Long term investment in forex
30 Book Value (Unit Curr) Net asset value
31 Book Value (Adj.) (Unit Curr) Book value adjusted to reflect asset's true fair market value
Product of the total number of a company's outstanding shares
32 Market Capitalisation and the current market price of one share
Cash Earnings per Share, profitability ratio that measures the
financial performance of a company by calculating cash flows on
33 CEPS (annualised) (Unit Curr) a per share basis
Cash Flow From Operating
34 Activities Use of cash from ongoing regular business activities
Cash Flow From Investing Cash used in the purchase of non-current assets–or long-term
35 Activities assets– that will deliver value in the future
Cash Flow From Financing Net flows of cash that are used to fund the company
36 Activities (transactions involving debt, equity, and dividends)
37 ROG-Net Worth (%) Rate of Growth - Networth
38 ROG-Capital Employed (%) Rate of Growth - Capital Employed
39 ROG-Gross Block (%) Rate of Growth - Gross Block
40 ROG-Gross Sales (%) Rate of Growth - Gross Sales
41 ROG-Net Sales (%) Rate of Growth - Net Sales
42 ROG-Cost of Production (%) Rate of Growth - Cost of Production
43 ROG-Total Assets (%) Rate of Growth - Total Assets
44 ROG-PBIDT (%) Rate of Growth- PBIDT
45 ROG-PBDT (%) Rate of Growth- PBDT
46 ROG-PBIT (%) Rate of Growth- PBIT
47 ROG-PBT (%) Rate of Growth- PBT
48 ROG-PAT (%) Rate of Growth- PAT
49 ROG-CP (%) Rate of Growth- CP
ROG-Revenue earnings in forex
50 (%) Rate of Growth - Revenue earnings in forex
ROG-Revenue expenses in forex
51 (%) Rate of Growth - Revenue expenses in forex
52 ROG-Market Capitalisation (%) Rate of Growth - Market Capitalisation
Liquidity ratio, company's ability to pay short-term obligations or
53 Current Ratio[Latest] those due within one year
Solvency ratio, the capacity of a company to discharge its
54 Fixed Assets Ratio[Latest] obligations towards long-term lenders indicating
Activity ratio, specifies the number of times the stock or
55 Inventory Ratio[Latest] inventory has been replaced and sold by the company
Measures how quickly cash debtors are paying back to the
56 Debtors Ratio[Latest] company
Total Asset Turnover The value of a company's revenues relative to the value of its
57 Ratio[Latest] assets
5|Page
S.No Field Name Description
Determines how easily a company can pay interest on its
58 Interest Cover Ratio[Latest] outstanding debt
59 PBIDTM (%)[Latest] Profit before Interest Depreciation and Tax Margin
60 PBITM (%)[Latest] Profit Before Interest Tax Margin
61 PBDTM (%)[Latest] Profit Before Depreciation Tax Margin
62 CPM (%)[Latest] Cost per thousand (advertising cost)
63 APATM (%)[Latest] After tax profit margin
64 Debtors Velocity (Days) Average days required for receiving the payments
65 Creditors Velocity (Days) Average number of days company takes to pay suppliers
Average number of days the company needs to turn its inventory
66 Inventory Velocity (Days) into sales
67 Value of Output/Total Assets Ratio of Value of Output (market value) to Total Assets
68 Value of Output/Gross Block Ratio of Value of Output (market value) to Gross Block
Checking top 5 rows again after fixing messy column names for ease of use
6|Page
Inference:
• The number of rows (observations) is = 3586 & The number of columns (variables) is = 67
• All the variables are numeric type except one variable (Co_Name) which is object type.
• There are missing values in 13 of the variables. Missing values will be treated with either mean
or median values of corresponding variables.
• There are outliers in the dataset. It will be treated for our analysis.
• The problem statement requires to predict “default” status of the company where the “Networth
Next Year” of the company is used to drive the “default” field. The “default” is 1 when “Networth
Next Year” is negative and it is 0 when “Networth Next Year” is positive.
• The “Default” field is created and added to the dataset based on the condition mentioned above.
Subsequently “Networth Next Year” is not considered further as it became redundant.
• Outliers are present in all of the independent variables. Outlier treatment is necessary for any
regression model, whereas in Regression, outliers pull the regression line towards itself thereby
affecting its slope. This distorts the reality and leads to faulty predictions.
• Treating outlier by using Inter Quartile range for each of numerical column.
• Values greater than Upper quartile range would be capped with 75% of quartile value
• Values lesser than Lower quartile range would be capped with 25% of quartile value
Q1 = 25th percentile, Q3 = 75th percentile - IQR = Q3 - Q1
Outlier = any value which lies beyond 1.5 times of IQR from Q1 and Q3 on either side
• The outliers would be replaced with Upper Quartile values or lower. And post outlier treatment
the numerical variables in boxplot:
7|Page
Before Treating Outliers
8|Page
Inference:
Given the fact that this is a financial data and the outliers might very well reflect the information
which is genuine in nature. Since there is data captured for small, medium as well as large
companies.
There are some missing values in the dataset which is to be treated in the further steps. Given the size
of the data set i.e. 3586 rows, there were not many missing values to start with.
Before imputing NULL Values :
There were a total 118 missing records around 0.05% observed in the entire data.
Null values were present in many columns, however significant number was present in
"Inventory_Vel_Days" column. This is the one which we treated.
There is a large number of zeros, these zeros don't add any predictive value. But it will cause "linear
algebra error" when using StatsModel .
Records with missing value in "Inventory_Vel_Days" column were imputed with the average value.
9|Page
Visually inspect the missing values in our data: A visual of all these missing values is give below - after
dropping
We will be using from sklearn.impute import SimpleImputer for treating Null values
10 | P a g e
Q1.3 Transform Target variable into 0 and 1
• There is no target variable defined – but since the objective is to build a model for investor to
• If the company’s Networth_Next_Year is positive and greater than 0 – then the company would
continue to return good investment for investor and thus could be transformed as 0 – NON-
DEFAULT
• If the company’s Networth_Next_Year is equal to 0 or less than it – then the company would
0 3198
1 388
Which is to say about 11% of the companies from the dataset are likely to default and ones the investor
could avoid investing in.
11 | P a g e
Q1.4 Univariate & Bivariate analysis with proper interpretation. (We choose to
include only those variables which were significant in the model building).
Univariate Analysis:
We performed the descriptive summary for the company data. Since most of the column data is
continuous, we can see the mean, standard deviation and percentile details for all the columns.
12 | P a g e
Some of Important Feature Box Plot
13 | P a g e
15 Significate Scaled Feature Variables - Distribution of column with Displot & Box plot:
14 | P a g e
15 | P a g e
Fig.6 Scaled Feature Variables - Distribution of column with Displot & Box plot:
16 | P a g e
Inference:
• ‘Selling Cost’ - has max companies around its mean. They have Right Skew with outliers on
higher side.
• ‘Cash Flow from Operating Activities’ - normal distribution with max companies lying around
the mean
• ‘PBIDT’ - ‘Profit before Int Depreciation and Tax’ - max companies are around the mean
with a prominent right skew. This indicates that there are still many companies with high
PBIDT
• ‘ROG Networth’, ‘ROG Capital Employed’, ‘ROG Total Assets’, ‘ROG PBIDT’, ‘ROG
PBT (Profit Before Tax)’, ‘ROG CP’, ‘Current ratio Latest’, ‘Interest Cover Ratio Latest’,
• ‘Value of Output to Total Assets’, ‘Net Sales, ‘Book Value Adjusted’ - these variables have
max density of companies around its mean with right skew. This indicates outliers on the
higher side.
• ‘APATM (After Tax Profit Margin)’ - has max density around its mean and a prominent left
skew. This indicates that there are many companies have their Net Profit on the lower side of
the distribution - Possible indication of default.
• Mostly, it is observed that there are many companies with good margin and financials before
tax and all other costs. But, after costs are considered, they slide to the lower half - Shows they
need to work on their costs and bottom line.
17 | P a g e
Bivariate Analysis:
Fig.7 - Gross Sales Vs Net Sales Fig.8 - Net worth Vs Capital Employment
• There exists linear relationship between these • As the capital increases, net worth also increases,
two important variables. but in some cases, capital seems to be disbursed
even for lesser net worth
18 | P a g e
Correlation Heatmap: As per Regression Feature Elimination (RFE) we got Top important
Variables and 1 Target Variable. These were significant in the model building
• Correlation heat-map of top 19 predictors and 1 Target, used to get the best correlation is given
above.
• Fixed_Asset_Ratio(Latest) and Value_of_output_to_gross_block &
Total_Asset_Turnover_Ration (Latest) and Value_of_output_to_Total_Assets :- the above pairs
of features show high correlation, it looks obvious as they seem derived or direct functions of
each other
• Target variable ‘default’ has high negative correlation with Book_Value_unit_curr. This indicates
as Book Value rises, Probability of Default falls
19 | P a g e
Inferences from Univariate and Bi-variate analysis:
• Most of the variables have skewed distribution. But we will not treat those distribution by any
• All the variables have outliers. These outliers will be treated as we are going to apply Logistic
• We also performed multi variate analsysis on the data to see if there are any correlation that are
• We observed that networth and networth next year were highly correlated. Apart from this, we
• Overall, high positive and negative correlation between variables can be seen above.This analysis
20 | P a g e
Q 1.5 Split the data into Train and Test dataset in a ratio of 67:33
• Split the data into Train and Test dataset in a ratio of 67:33 and use random_state =42.
• We are splitting the data set as X (data which has independent variables) and y (data which has
• Model Building is to be done on Train Dataset and Model Validation is to be done on Test
Dataset.
• Out of Total 3586 So after split, Train Set has 2402 observations & Test Set has 1184
observations
21 | P a g e
Training Data Columns :
Build Logistic Regression Model (using statsmodel library) on most important variables on Train
Dataset and choose the optimum cut-off. Also showcase your model building approach.
22 | P a g e
For model building, we try to approach Recursive Feature Elimination (RFE) and we want to
select top 21 features (1/3rd of total feature variables) that would contribute to the model well.
We give weightage to each variable and based on the weightage; rankings are provided.
For modelling we will use Logistic Regression will recursive feature elimination.
Below are the highest contributing independent variables to the model building.
23 | P a g e
Q 1.7 Validate the Model on Test Dataset and state the performance matrices.
Also state interpretation from the model
➢ We train the model and then validate the model in both the training and testing sets. We are
plotting the confusion matrix and classification report for both sets.
➢ We could see high precision and accuracy, but the recall seems to be less in the training data.
➢ We need to improve the recall value as that would give us True Positives (TP), which in turn
means that , we will correctly identify the defaulters accurately, because if we miss a defaulter,
that would account to the bank paying higher interests to the existing debts and cash flow will
not be regularised in the bank.
Confusion matrix and Classification Report for the training set:
We train the model and then validate the model in both the training and testing sets.
24 | P a g e
Confusion matrix and Classification Report for the test set :
We could see high precision and accuracy, but the recall seems to be less in the testing set.
Accuracy of over 95% was achieved, while precision, recall and f1 score were also very high at
96%, 98% and 97% respectively.
We could see for both the models, accuracy and precision (ratio of True Positive’s to the entire
Positive’s) is on the higher side, but the recall seems to be downfall in both these sets.
This seems to be the case because we had an imbalanced dataset for our model.
➢ So, we have balanced our default values (the ratio of 0’s to 1’s is to be increased) In our dataset,
we only had 11% of the defaults, we try to balance the dataset using SMOTE technique before
fitting it in our model.
➢ After applying the SMOTE technique, we fit the model and predict our values in both training
and testing sets.
25 | P a g e
Classification Report for the Training Set (SMOTE):
We could see that the recall has improved greatly, so the chances of identifying our defaulters
has significantly improved and there is less chance of the model missing out on any potential
26 | P a g e
Classification Report for the Testing Set (SMOTE):
Accuracy of 92% and, precision, high recall and f1 score of 99% ,92% and 95% respectively
were also observed on the test set.
Finally, we are able to achieve a descent recall value without overfitting. Considering
the opportunities such as outliers, missing values and correlated features this is a fairly
good model. It can be improved if we get better quality data where the features
explaining the default are not missing to this extent. Of course, we can try other
techniques which are not sensitive towards missing values and outliers.
27 | P a g e
Interpretation:
This clearly indicates that the model which has been built is highly efficient and has been
able to capture the correct variable for prediction. It has been proven to work on train as
well as test data.
Re-call value for the testing & the training set are near and this model is the best suited to
identify correct defaulters because of high-recall value in both sets.
Precision seem to be on a lower side for the sets because of the SMOTE technique as we
try to create more values to balance the defaulter ratio.
But, in this model , recall seems to be an important factor as we stress on identifying the
defaulters accurately.
From Multi-variate Analysis, we observed that many companies had good profit margins
before considering taxes, interests and other costs. But once all costs are considered along-
with taxes and depreciation, majority of these companies slide to the bottom half in
Profitability. These companies should focus on optimizing their bottom line.
THE END
28 | P a g e