Logistic Regression
Logistic Regression
Logistic Regression
Table of Content
Contents
Objective.................................................................................................................... 2
About Logistic Regression........................................................................................... 3
CONCEPT.............................................................................................................. 3
Steps of developing a Logistic Regression Model.......................................................4
Key Metrics Finalization........................................................................................... 4
Rolling Performance Windows..............................................................................4
Data Preparation..................................................................................................... 6
Data Treatment.................................................................................................... 7
Derived variables creation................................................................................. 10
Data Split........................................................................................................... 11
Oversampling..................................................................................................... 12
Variable Selection/Reduction.................................................................................12
Data Distribution Related Issues........................................................................12
Information Value............................................................................................... 13
WOE Approach................................................................................................... 16
MULTI COLLINEARITY CHECK..............................................................................16
Standardization of Variables............................................................................... 19
Logistic Regression Procedure............................................................................19
Key Model Statistics........................................................................................... 20
Model Fit Statistics.................................................................................................... 21
Model description............................................................................................... 22
KS Statistic and Rank Ordering ........................................................................23
Gini and Lorentz curves...................................................................................... 24
Divergence Index Test........................................................................................ 25
Clustering checks ............................................................................................ 26
Deviance and Residual Test................................................................................27
Hosmer and Lemeshow Test...............................................................................28
Model Validation.................................................................................................... 29
1)
2)
Objective
The Purpose of this document is to guide new joiners or people new to Logistic
modelling on how to carry out each step starting from data collection/preparation to
logistic modelling results and validation.
The level of detail of each stage will be primary. It does this by allowing the reader
to start at the beginning, seeing how each stage of the process contributes to the
overall problem, and how it interacts and flows together while progressing towards
a final solution and its presentation.
The focus will be on execution of each step of the process and methods used to
verify the integrity of the process.
CONCEPT
Logistic Regression predicts the probability (P) of an event (Y) to occur through the
following equation:
Log(P/(1-P)) = +1X1+2X2+..+nXn
P is the probability that the event Y occurs, p(Y=1)
Odds Ratio = P/1-P
Log{P/1-P} = log of the odds ratio
METHOD OF ESTIMATION
Performance Window: Time frame from where the dependent variable (Y)
comes from
Observation Point
Performance
Observation
WindowWindow
In some problems, the target variable needs to be created and possibly defined. For
example, the client may want to build a model to identify potential churn but might
not have a clear definition of attrition. In such situations, it might often help to look
at the data and come up with some set of rules/algorithm to identify the dependent
variable.
Again, the definition of the dependent variable in certain cases may influence the
overall value of the model. For example, say the objective is to predict bankruptcy
of cardholders. We can choose to define the dependent variable to capture
bankruptcy next month or bankruptcy in 3 months. Clearly the latter model is more
useful if the objective of the analysis is to take some pre-emptive action against
those likely to go bankrupt.
In the current sample data, target variable is defined as :
Exclusion Criteria:
Policy exclusions and any other exclusions needs to be undertaken prior to model
development to ensure data is not biased and model base is representative of the
actual population.
Data Preparation
The goal of this step is to prepare a master data set to be used in the modeling
phase of the problem solution. This dataset atleast should contain:
A key, or set of keys, that identifies each record uniquely
The dependent variable relevant to the problem
All independent variables that are relevant or may be important to the problem
solution
In the early stages of a solution, it can be sometimes hard to determine an exact set
of independent variables. Often, nothing is left out to begin with, and the list of
relevant variables is derived and constantly updated as the process unfolds.
If the required master data is spread across several data sets, then the pertinent
records and variables will need to be extracted from each dataset and merged
together to form the master dataset. If this must be done, it is very important that
proper keys are used across the datasets so that not only do we end up with all the
needed variables in the final dataset, but that you are merging the datasets
correctly. For example, you may have a customer dataset with customer level
information such as name, dob, age, sex, address etc. (a static data set), and
another data set, account data, which contains account level information such as
account number, account type(savings/current/mortgage/Fixed deposit) , total
balance , date of opening , last transaction date etc. This account level dataset
needs to be rolled up to customer level before merging with customer dataset to
create master dataset.
PS:- If you try to merge two datasets by a common numeric variable, but whose
lengths were defined differently in each dataset, you may see a warning in the log
file similar to:
WARNING: Multiple lengths were specified for the BY variable by_var by input data sets. This may
cause unexpected results.
It is generally not wise to overlook log file warnings unless you have a very good
reason to. A short data step redefining the length of the shorter variable in one of
the datasets before merging will suffice to get rid of the warning, and could reveal
important data problems, such as information being truncated from some values of
the BY variables in the data set with the shorter length.
Data Treatment
Once master dataset has been created, univariate macro(if available) needs to be
run to understand the data. Certain characteristics of the data that need to be
looked at are: Variable name
Format
Number of unique values
Number of missing values
Distribution (proc means output for numeric variables; highest and lowest
frequency categories for a categorical variable)
o Numeric variables: standard numerical distribution including the mean,
min, max, and percentiles 1, 5, 10, 25, 50, 75, 90, 95, and 99
o Categorical variables: no of times the variable takes each categorical
value .
Ob
s
na
me
ty
pe
var_len
gth
n_p
os
numo
bs
nmi
ss
uniq
ue
mean_or_t
op1
min_or_t
op2
p1_or_to
p3
Var1
nu
m
46187
929
0.12648
Delet
ed
Colum
ns
p99_or_b
ot2
max_or_b
ot1
0.769
0.998
Var2
nu
m
46187
505
0.06473
0.285
0.944
Var3
nu
m
16
46187
175
714.42876
650
756
794
Var4
nu
m
24
46187
257
656.30054
755
794
Var5
nu
m
32
46187
1067
3
100.50368
1710.922
136318.12
3
Var6
nu
m
40
46187
3312
3
305.97356
2552.431
221315.61
4
Var7
nu
m
48
46187
1332
0.11786
1.073
47.794
Var8
cha
r
56
46187
10
2::524952
::429733
1::37644
1
7::5468
8::2781
Univariate_Macro.txt
Put the library path location ( where the dataset exists) and the dataset name( on
which univariate will run) in place of XXX at the bottom of the Univariate code
before running it.
Basic things that should be looked for when first assessing the data:
Are data formats correct?
o Are numerical variables stored as text? Do date variables need to be
converted in order to be useful?
Which variables have missing values?
Data Outliers?
Do any variables exhibit invalid values (many 9999999, 101010101, 0/1
values, etc)?
o If you have a data dictionary provided by the client, there may be
information on invalid values, so this would be the first thing to check
Are any distributions clearly out of line with expectations?
for the bin. The variable is then imputed based on the bin into which it falls. This is a
good method but will need a very high correlation among predictors and require a
very high (close to 100%) fillrate.
Do not impute missing values
Predictor variable that could be important in model, but has large percentage of
values missing should either be excluded records with missing values or exclude
variable from model (imputation using above techniques and inclusion in model
would result in either a model with inflated performance statistics or reflecting data
manipulation rather than original source data)
Outlier Treatment
It is very important to eliminate outliers from the dataset before any analysis can be
performed. Outliers can be detected using Proc Univariate output.
Comparing the P99 and Max Values (or the P1 and the Min values), we can identify
the variables having possible outliers .
Here are some common ways of dealing with outliers:
The first and second methodis easier to implement but lose the ordinality of data.
The fourth method takes care of the outlier problem but does not lose the ordinality
of the data.
Derived variables creation
Derived variables are created in order to capture all underlying trends and aspects
of the data.
Rather than just using the raw variables in the model; taking proportions, ratios and
making indexes sometimes help reduce bias and also helps in identifying new
trends in the data.
For E.g.: Taking average monthly spends instead of total spends in last 12 months is
more insightful because it helps neutralize the effect of new customers having lower
spends due to the reduced tenure. The normalized average value provides a more
accurate comparison amongst customers
Data Split
Development dataset Fit the model on this dataset
Development and validation sample are split in any ratio with 50 - 80% records in
development sample. Sample code for doing a 70-30 split is below :data temp;
set xxx;
n=ranuni(8);
proc sort data=temp;
by n;
data training validation;
set temp nobs=nobs;
if _n_<=.7*nobs then output training;
else output validation;
run;
In many situations, data is scarce and it is not possible to generate separate
validation datasets. In such cases, sampling techniques are used as explained below
:-
Bootstrapping
This technique involves re-estimating the model over numerous randomly drawn
sub- samples. It is used in several ways. Often the model coefficients are taken to
be the average of the coefficients in the sub sample models. The final model is then
validated.
In other instances, bootstrapping is used as a variable reduction technique.
Here are some steps to perform the task:
Oversampling
When the target event is rare(less than 2-5%), it is common to oversample the rare
event, that is, take a disproportionately large number of event cases. Oversampling
rare events is generally believed to lead to better predictions.
In the case of logistic regressions, oversampling only affects the intercept term and
the coefficients are left unaffected. Therefore, the rank order produced by the model
estimated on oversampled data holds true and will not be changed even if the
intercept is corrected for oversampling.
Therefore, if the objective of the logistic regression is to produce a scorecard, then
no correction is required for oversampling.
Variable Selection/Reduction
We might get client data with hundreds and even thousands of variables. It is
important to reduce the dimension of the dataset by eliminating
redundant/irrelevant variables before meaningful analysis can be performed.
Irrelevant variables carry no meaningful information while redundant variables carry
little or no additional information. So one example of Redundant variable could be
say variable C which is a linear combination of variable A and B. Irrelevant variable
on the other hand would be something like variable X which has very low correlation
with the dependent variable Y.
Various techniques used for variable reduction:Data Distribution Related Issues
Following categories of variables are generally not considered for model
development exercise.
Date variables
Information Value
For logistic regression, the raw variable is replaced by its Weight of evidence (Woe),
which has higher predictability than the raw variable.
Weight of Evidence is the log (Event rate/ Non-event rate) for each category (bin) of
the variable considered for analysis.
Methodology to calculate Weight of Evidence:
Fine Classing: Fine Classing is a method to divide the population in categories that
differentiate events from non-events. The objective of fine classing is to identify
categories that can differentiate event from non-events. Customers belonging to
categories that can differentiate events from non-event have high likelihood of
having objective=positive in the final model. Hence have higher predictability. To
fine class the populations divide the population in buckets of (5%-10%) each
1. Calculate the proportion of events in each category
Illustrative Example
csv1.sas
fclassc1_n2.sas
fclassd1_n2.sas
Fineclassing.sas
fc_code_postModific
atoin.xls
The information value gives a measure of how well the characteristic can
discriminate between good and bad and whether it should be considered for
modeling. As a rule of thumb apply the following, however these cut offs can be
changed based on data.
< 0.03 Not predictive dont consider for modelling
0.03 0.1 Predictive consider for modelling
> 0.1 Very Predictive use in modelling
Coarse Classing:
Coarse Classing is a method to identify similar categories. To coarse class
population, group categories with similar log odds and same sign. Calculate log
odds for the grouped category. The new log odds is Weight of Evidence of the
variable
Each of the characteristics deemed to be predictive (information value > 0.03)
should be grouped (normally performed using fine class output and a ruler) into
larger more robust groups such that the underlying trend in the characteristic is
preserved. A rule of thumb suggests that at least 3% of the goods and 3% of the
bads should fall within a group.
For continuous characteristics 10% bands are used to get an initial indication of the
predictive patterns and the strength of the characteristic. However, for the grouping
of attributes more detailed reports are produced using 5% bands.
Try to make classes with around 5-10% of the population. Classes with less
than 5% might not be a true picture of the data distribution and might lead to
model instability.
The trend post coarse classing should either be monotonically increasing,
decreasing, parabola or an inverted parabola. Polytonic trends are usually not
acceptable
Business inputs from the SMEs in the markets are essential for coarseclassing process as fluctuations in variables can be better explained and
classes make business sense.
WOE Approach
Concept:
In the standard WOE approach every variable is replaced by its binned counterpart.
The binned variable is created by assigning a value equal to WOE of each of the
bins formed during coarse classing.
WOE = ln (% Good/ % Bad)
WOE = 0 is imputed for the bins containing missing records and for bins that
consisted of less than 2% of the population.
Advantage:
Every attribute of the variable is differently weighed hence taking care of the
neutral weight assignment in case of dummy approach
Disadvantage:
Lesser degrees of freedom hence the chances of a variable representation is lower
in comparison to the dummy approach.
Points to note:
1. If the factor loadings on a particular Eigen vector are not above the cut-off,
that vector is ignored and next Eigen vector would be looked for.
2. Not more than 3 variables could be dropped.
3. Not more than 250 variables could be used because of Excel limitation on the
number of rows.
4. Clear Contents in the columns M & P of Multicol8.xls sheet before start using
the macro for each new project.
Multi Collinearity.sas
1. SAS Program:
Inputs required apart from the Library and dataset name are 1. List of
variables for REG and LOGISTIC procedures 2.Response Variable Name
Output: One Excel sheet "mc.xls" created in directory C ( Can change the
location and Name of the file).
Please go through the program and input the values at appropriate places
( COMMENTS will guide you in doing that)
MultiCol8.ZIP
3. Outputs:
List of Variables Retained (column M) and Removed (column P) will get pasted in
the same excel sheet "MultiCol8.xls"
Tracking of variables removed from the first iteration to the last itearation. Name
of the tracking file is to be specified each time at the time of running macro (eg:
c:\log.txt). You can use same file name through out the project. This would have
the history of variables removed and corresponding correlated variables. Please
make sure that you change the file name when you are working on a new
project. Otherwise the existing file "log.txt" gets appended.
Idea of having this tracking file is to find out the replacement variables for any
variable that was dropped any point of time. Open this txt file in Excel "
Delimited , select Comma and others "-" .
Each row will give the variables that are correlated.
Columns (B, C and D) give the variables removed at a point. Variables from F
column are correlated with the removed variables and retained at that
respective point.
Clustering of variables
Varclus is a SAS procedure that groups variables into clusters based on the variance
of the data they explain. It is unsupervised in that it does not depend on the
dependent variable. In the background it performs principal component like analysis
and then makes the components orthogonal so that that they contain distinct set of
variables.
Here are some practical implementation steps for running varclus:
Selected variable should have high r-square with own cluster and low r-square
with next closest cluster
1-R-squared ratio is a good selection criteria ( 1-r-quared with own cluster/1-rsquare with next closest cluster)
variable
Cluster
bb
cc
Cluster
kk
ll
Cluster
mm
Cluster
nn
R-squared with
1Own
Next closest R-squared
Cluster
Cluster
Ratio
cluster#
69
0.5804
0.443
0.468 0.788722
0.1362 0.644825
0.8057
0.6345
0.2993 0.277294
0.2918 0.516097
0.5625
0.0013 0.438069
0.7797
0.2811
70
71
72
0.30644
for any of the variables the user specifies to be considered for the model, so it is
necessary that general modeling data preparation and missing imputation have
occurred. Also, PROC LOGISTIC will only accept numeric variables as predictors,
unless categorical level character variables are specified in the CLASS statement.
You must also be careful with categoricals that are coded with numerals the
program will treat these as if they were continuous numerics unless they are
specified in the CLASS statement.
To run a logistic regression in SAS, and estimate the coefficients of the model, one
would run code similar to the below:
proc logistic data = <libname>.<modeling dataset> DESCENDING;
model dependent_variable = var1 var2 var3;
run;
The DESCENDING options lets SAS know that the value of the dependent variable
we wish to predict is 1, and not 0.
With no other options selected, SAS will estimate the full model, meaning all
variables will be included, regardless of whether their coefficients are significantly
different from 0.
PROC LOGISTIC also permits:
forward selection
backward elimination
forward stepwise
Intercept Only
AIC
501.977
470.517
SC
505.968
494.466
-2 Log L
499.977
458.517
Chi-Square
DF
Pr > ChiSq
41.4590
<.0001
Score
40.1603
<.0001
Wald
36.1390
<.0001
DF
Wald Chi-Square
Pr > ChiSq
GRE
4.2842
0.0385
GPA
5.8714
0.0154
RANK
20.8949
0.0001
The portion of the output labeled Model Fit Statistics describes and tests the overall
fit of the model. The -2 Log L (499.977) can be used in comparisons of nested
models, but we won't show an example of that here.
In the next section of output, the likelihood ratio chi-square of 41.4590 with a pvalue of 0.0001 tells us that our model as a whole fits significantly better than an
empty model. The Score and Wald tests are asymptotically equivalent tests of the
same hypothesis tested by the likelihood ratio test, not surprisingly, these tests also
indicate that the model is statistically significant.
The section labeled Type 3 Analysis of Effects, shows the hypothesis tests for each
of the variables in the model individually. The chi-square test statistics and
associated p-values shown in the table indicate that each of the three variables in
the model significantly improve the model fit. Forgre, and gpa, this test duplicates
the test of the coefficients shown below. However, for class variables (e.g., rank),
this table gives the multiple degree of freedom test for the overall effect of the
variable.
Model description
Variables used in the model:
Standard
Parameter
Intercept
DF
Estimate
Error
Wald
Chi-Square
-5.5414
1.1381
23.7081
4.2842
GRE
0.00226
0.00109
GPA
0.8040
0.3318
5.8714
Pr > ChiSq
<.0001
0.0385
0.0154
RANK
1.5514
0.4178
13.7870
0.0002
RANK
0.8760
0.3667
5.7056
0.0169
RANK
0.2112
0.3929
0.2891
0.5908
Point
Effect
95% Wald
Estimate
Confidence Limits
GRE
1.002
1.000
1.004
GPA
2.235
1.166
4.282
RANK 1 vs 4
4.718
2.080
10.701
RANK 2 vs 4
2.401
1.170
4.927
RANK 3 vs 4
1.235
0.572
2.668
Percent Concordant
Percent Discordant
69.1
Somers' D
30.6
Gamma
Percent Tied
0.3
Tau-a
Pairs
34671
0.386
0.387
0.168
0.693
G B
critical value of 1.36/
between the areas under the G(s) and B(s) curves, i.e.:
. If G(s) and
B(s) are identical the Lorentz curve is the straight line between (0, 0) and (1, 1), and
the required integral is 0.
The Lorentz curves above show the plot of the cumulative percentage of
bad against the cumulative percentage of the good in the development
and validation samples. The darkblueline shows the distribution of good
and bad under random scoring whereas the brown curve (development
sample) and green curve (validation sample) show the lift provided by
the Conversion Rate Model over and above random selection. Model
exhibits similar level of performance across development and validation
samples as can be seen from the almost overlapping Lorentz curves.
D
Divergence Index -
x g xb
s
Null Hypothesis (H0): The mean score of the good in the population is less than
equal to the mean score of the bad in the population. A robust model implies that
the mean score for good will be significantly greater than the mean score for bad
i.e. the null hypothesis needs to be rejected. As shown by the p-value in the Table
4.2.4, the null hypothesis is being rejected at 1% level of significanc
Clustering checks
A good model should not have significant clustering of the population at any
particular score and the population must be well scattered across score points.
Both the Pearson and deviance test whether or not there is sufficient
evidence that the observed data do not fit the model. The null hypothesis is
that the data fit the model. If they are not significant it suggests that there is
no reason to assume that the model is not correct / we accept that the model
generally fits the data. For this model both the Pearson and Deviance test are
coming as insignificant thereby further confirming that the model is fitting the
data.
Hosmer and Lemeshow Test
The Hosmer-Lemeshow Goodness-of-Fit test tells us whether we have
constructed a valid overall model or not. If the model is a good fit to the data
then the Hosmer-Lemeshow Goodness-of-Fit test should have an associated
1 Score is defined as the probability of being a responder (as per Conversion Rate Model) multiplied by 1000
p-value greater than 0.05. In our case the associated p value is coming as
high for both the development and the validation sample signalling that the
model is a good fit for the data.
Model Validation
1) Re-estimation on Hold out sample