RMSC3001 2023-24 PS2
RMSC3001 2023-24 PS2
Department of Statistics
The deadline for this Problem Sheet is 2359 on Saturday 17th February. Please submit your solutions via the
link provided on the course Blackboard page - if you must submit your solutions in hard copy, please contact me
at [email protected] in advance. No late submissions will be accepted. A late submission will
receive a mark of zero. Students may discuss set problems with others, but their final submissions must be their
own work. Do show your working - it helps us to give you marks.
Please answer the following problems.
1. “Credit Risk Analytics: Measurement Techniques, Applications, and Examples in SAS” (2016) by Baesens,
Roesch and Schuele uses the HMEQ dataset to illustrate credit scoring models. Access the online version of
the book via the CUHK library website: p39 describes the dataset’s response and characteristics; output of a
SAS modelling procedure applied to the data is shown in Exhibit 5.6 on p134-135.
Read the description of this dataset to understand what the characteristics and possible attributes are. Food
for thought (i.e. no need to write down the answer to this question): how do you expect these characteristics
to affect the probability the applicant is Good?
(a) The original dataset has 12 characteristics and one response BAD = 1 if the applicant defaulted or is
seriously delinquent, = 0 otherwise. How many characteristics does the model presented in Exhibit 5.6
use? Write down those which have been discarded.
(b) The model in Exhibit 5.6 is a logistic regression. Note that according to the documentation for LOGISTIC
procedure in SAS, SAS doesn’t sees the 0s and 1s in the BAD column as numbers but rather as labels
and defaults to model the logit of the lowest label - in this case, 0. Write out the fitted model in full,
with all the estimated parameters.
(c) Use the model to estimate the probability of borrowers with the following attributes not defaulting:
i. LOAN = 3578, MORTDUE = 102370, VALUE = 120953, REASON = HomeImp, JOB = Office,
YOJ = 2, DEROG = 0, DELINQ = 0, CLAGE = 260.3315, NINQ = 0, CLNO = 13, DEBTINC =
31.5885
ii. LOAN = 65500, MORTDUE = 205156, VALUE = 290239, REASON = DebtCon, JOB = ProfExe,
YOJ = 2, DEROG =0, DELINQ = 0, CLAGE = 98.8082, NINQ = 1, CLNO = 21, DEBTINC =
130.661.
(d) Exhibit 5.11 on p139-141 provides a scorecard based on the logistic regression model in Exhibit 5.6.
Calculate the scores for both the applicants in the previous question.
2. You test an imaginary scorecard on a dataset in which customers’ data and response (good or bad) are
recorded. Say only six scores (150, 200, 250, 300, 350 and 400) are possible. You observe the following
Of all the customers with a score of 150, 0 are good, 12 are bad.
Of all the customers with a score of 200, 16 are good, 11 are bad.
Of all the customers with a score of 250, 92 are good, 15 are bad.
Of all the customers with a score of 300, 194 are good, 18 are bad.
Of all the customers with a score of 350, 360 are good, 12 are bad.
Of all the customers with a score of 400, 208 are good, 0 are bad
(a) Calculate the co-ordinates of the five points on the ROC curve between (0, 0) and (1, 1) as the score
increases.
(b) Use Excel or other software to plot the ROC curve.
1
(c) Without using R, calculate the AUROC and Gini coefficient.
3. For this question, please use R and Excel and include your R code and xlsx file in your submission.
Download the “Default.csv” file, originally from Introduction to Statistical Learning, 1st Ed. by James,
Witten, Hastie and Tibshirani, from the Problem Sheet 2 content area on Blackboard. Split the data into
two parts: from Row 2 to Row 7001 inclusive, which we call the training set; from Row 7002 to Row 10001
inclusive, which we call the testing set. The dataset has four columns
default: binary response for whether the credit card holder defaulted that month or not;
student: binary characteristic, whether the credit card holder was a student or not;
balance: numeric characteristic, the holder’s credit card balance at the end of that month;
income: numeric characteristic, the credit card holder’s annual income.
(a) Use R (or other software) to fit a logistic regression model to the training set, using all three character-
istics. To be clear, you should fit log(Ω(Good|Characteristics)). You don’t need to coarse classify the
characteristics; you don’t need to include interactions between characteristics in your model. Write down
the parameter estimates, their standard errors and their Z-statistics. Write down the fitted equation.
(b) Now treat the logit of the credit card holder as a score. Find its AUROC on the testing set.
(c) Investigate possible three possible cut-off scores by calculating confusion matrices for P (G|x) = 0.5, 0.75, 0.9
by applying your scoring system to the testing set.
(d) The specificity of a prediction is the probability that given H0 is true, the prediction is correct. The
sensitivity of a prediction is the probability that that given H0 is not true, the prediction is correct. How
are specificity and sensitivity related to Type I and Type II errors?
(e) Taking H0 to be “the applicant is Good”, comment on the sensitivities and specificities resulting from
using the three cut-off scores in part (b).
(f) Calculate the three swap set matrices for the three possible pairings of cut-off scores. Comment.
THE END