Assignment 2
Assignment 2
Student Name PG ID
Saumya Shukla 62110149
Abhishek Batra 62110322
Samank Singh 62110162
Mukund Bubna 62110646
We see that the P value i.e., probability > |t| is <0.0001. Hence, the probability of making Type-I
error while rejecting the null hypothesis is <0.0001.
Therefore, we can reject the null hypothesis at this significance level. It is highly likely that
population proportion of repeat customer is not ¼.
2. Does the FICO score of the borrower appear to be normally distributed? Justify your answer.
The distribution of FICO is almost normal with slightly positive skewness of 0.2248. We can
observe the quantile box plot, most observations align with normal line (i.e., lies within the control
limits and is approximately straight), with some skew on the tails. The median is slightly greater
than the mean. Overall, the sample observations approximate normal distribution.
3. State the 95% confidence interval for the average FICO score and give a brief interpretation of
what this interval means. Be sure to check that it is appropriate to form such a confidence
interval for your data.
The 95% confidence interval for FICO is from 568.728 to 579.8398. This implies that, there is a 90%
chance that the population mean of FICO score will lie in this interval based on this sample.
We have assumed that the sample is unbiased, independent and of enough size for CLT to be
applicable, also normal distribution has been assumed for t-statistic to apply.
4. A manager at the loan operation claimed that the average age (the Years in Business variable)
of all the businesses served by this firm is less than 8 years. Do you agree, based on your data?
Explain briefly.
Let, H0: µ≤ 8, HA : µ > 8
Where, H0 is null hypothesis and µ is the population mean of years in business of loan taking
companies.
We see that the P-value i.e., probability > |t| is <0.0001. Hence, the probability of making Type-I
error while rejecting the null hypothesis is <0.0001.
Therefore, we can reject the null hypothesis at this significance level. It is highly likely that
population mean of years in business of loan taking companies is not 8.
[Note. We are assuming that population is normally distributed]
5. To identify the average PRSM in the population to 2 decimal places (i.e., to have the margin of
error less than 0.005), how large a sample would you recommend?
DMOE=0.005
t ( 1−α , n−1 ) S
Let n be the sample size, then n≥( )^2
DMOE
From the above approximately normal distribution of PRSM score, we can take S=0.115201,
Confidence interval of 95% and follow the iterative process (as shown below) to achieve the ideal
sample size of 2042 loans.
Hence, with a random sample of 2042 loans, we can predict the mean PRSM population with a
MOE of 0.005 and 95% confidence.
6. Is the population average PRSM score statistically significantly different from 1?
(a) Indicate an answer to the question, with a brief account of your analysis.
Let, H0: µ= 1, HA : µ ≠ 1
Where, H0 is null hypothesis and µ is the population mean of PRSM score of loans.
We see that the P-value i.e. probability > |t| is <0.0001. Hence, the probability of making Type-I
error while rejecting the null hypothesis is <0.0001.
Therefore, we can reject the null hypothesis at this significance level. It is highly likely that
population mean of years in business of loan taking companies is not 1.
(b) What are the implications of your answer for the business?
Since the sample mean of 627 loans is 0.7998, and the above hypothesis confirmed that
population estimate is significantly different from 1. Based on this sample, we can assume that the
population PRSM score is less than 1. Building on this, this is a negative indicator for the company
since the loans are not being paid back at the required rate. The longer time in paying back the
loan, may increase the chances of default, and may decrease working capital for the company.
7. The Chief Risk Officer is particularly concerned about the percentage of loans that have PRSM
scores of less than 0.7. What can you tell her about this percentage from your data? Approach
this question by creating a confidence interval for the proportion of loans in the population that
have a PRSM score of 0.7 or less.
As per the sample PRSM data of 627 loans, the population proportion of lower than or equal to 0.7
PRSM score loans is between 0.1665 and 0.22903 with 95% confidence level.
This implies that there is a 95% chance that the proportion of PRSM Scores lower than 0.7 will lie
between the above interval.
8. Do loans from ISO SPS have significantly different average PRSM scores than those from EZ
Check?
Let, H0: µ1- µ2= 0, HA: µ1- µ2 ≠ 0
Where, H0 is null hypothesis. µ1 is the population mean of PRSM score of loans from SPS and µ2
of EZ Check.
We see that the P-value i.e., probability > |t| is <0.0001. Hence, the probability of making Type-I
error while rejecting the null hypothesis is <0.0001.
Therefore, we can reject the null hypothesis at this significance level. It is highly likely that ISO SPS
have significantly different average PRSM scores than those from EZ Check.
9. Is it possible that the results of the previous comparison have been confounded by a lurking
variable? If so, suggest a possible lurking variable that could influence the comparison.
Otherwise, explain briefly why it is not possible.
It is possible that the results have been confounded by a lurking variable, if there is an inherent
process variation in type of loans granted to SPS and EZ Check. We have considered SPS and EZ
Check loans to be largely similar, i.e., overall similar type loans were given to both ISOs (loan
amount, rate of interest, repayment terms etc.). But this assumption may not be true. Terms and
amount of loans are not necessarily random; the amount of loan, rate of interest, repayment
terms etc. are dependant on screening and application procedure. Therefore, it could be the case
that EZ check is able to acquire loans at lower rate of interest (as it is deemed a less risky
company) and hence is able to repay the loan faster. In that case we may have confounded the
impact of lurking variable – loan interest rate with ISO and would need more information
regarding the type of loans.
10. It is the case for most datasets that there are significantly more original loans than repeat
loans. Using two-sample comparisons, do the original/repeat loans look different in any of the
following respects? (Just report a p-value and provide a conclusion for each of the following
variables; no elaborate comparison is necessary.)
In this question, we will conduct a two-sample comparison test for checking whether the original
and repeat loans LOOK DIFFERENT in the given following respects.
The null hypothesis H0: µ1- µ2= 0
Alternate Hypothesis HA: µ1- µ2 ≠ 0
Where µ1 is the mean of original loans and µ2 is the mean of repeated loans for various factors
testes below.
(a) FICO
The p value in this case is 0.0406 implying that if significant α=0.05, we would reject the null
hypothesis that original and repeat loans have the same population mean of FICO score. This
would mean that alternate hypothesis i.e., that original and repeat loans have different population
mean of FICO score is true. [If we reject the null hypothesis, we will make Type I error with 0.0406
probability]
(b) Years in business
The p value in this case is 0.1474 implying that if significant α=0.05, we would accept the null
hypothesis that original and repeat loans have the same population mean of years in business. So,
original and repeat loans are not different in this respect. [If we reject the null hypothesis, we will
make Type I error with 0.1474 probability]
(c) Satisfied accounts
The p value in this case is 0.1130 implying that if significant α=0.05, we would accept the null
hypothesis that original and repeat loans have the same population mean of satisfied accounts. So,
original and repeat loans are not different in this respect. [If we reject the null hypothesis, we will
make Type I error with 0.1130 probability]
(d) Current delinquent credit lines
The p value in this case is 0.0912 implying that if significant α=0.05, we would accept the null
hypothesis that original and repeat loans have the same population mean of satisfied accounts. So,
original and repeat loans are not different in this respect. [If we reject the null hypothesis, we will
make Type I error with 0.0912 probability]
(e) Number of derogatory legal item
The p value in this case is 0.8108 implying that if significant α=0.05, we would accept the null
hypothesis that original and repeat loans have the same population mean of satisfied accounts. So,
original, and repeat loans are not different in this respect. [If we reject the null hypothesis, we will
make Type I error with 0.8108 probability]