2-Logistic Regression
2-Logistic Regression
Ramasubramanian V.
I.A.S.R.I., Library Avenue, Pusa New Delhi - 110 012
[email protected]
1. Introduction
Regression analysis is a method for investigating functional relationships among variables.
The relationship is expressed in the form of an equation or a model connecting the response
or dependent variable and one or more explanatory or predictor variables. Most of the
variables in this model are quantitative in nature. Estimation of parameters in this regression
model is based on four basic assumptions. First, response or dependent variable is linearly
related with explanatory variables. Second, model errors are independently and identically
distributed as normal variable with mean zero and common variance. Third, independent or
explanatory variables are measured without errors. The last assumption is about equal
reliability of observations.
In case, our response variable in model is qualitative in nature, then probabilities of falling
this response variable in various categories can be modeled in place of response variable
itself, using same model but there are number of constraints in terms of assumptions of
multiple regression model. First, since the range of probability is between 0 and 1, whereas,
right hand side function in case of multiple regression models is unbounded. Second, error
term of the model can take only limited values and error variance are not constants but
depends on probability of falling response variable in a particular category.
Generally, conventional theory of multiple linear regression (MLR) analysis has been applied
for a quantitative response variable, while for the qualitative response variable or more
specifically for binary response variable it is better to consider alternative models. As for
example, considering following scenarios:
A pathologist may be interested whether the probability of a particular disease can be
predicted using tillage practice, soil texture, date of sowing, weather variables etc. as
predictor or independent variables.
An economist may be interested in determining the probability that an agro-based
industry will fail given a number of financial ratios and the size of the firm (i.e. large
or small).
Usually discriminant analysis could be used for addressing each of the above problems.
However, because the independent variables are mixture of categorical and continuous
variables, the multivariate normality assumption may not hold. Structural relationship among
various qualitative variables in the population can be quantified using number of alternative
techniques. In these techniques, primary interest lies on dependent factor which is dependent
on other independent factors. In these cases the most preferable technique is either probit or
logistic regression analysis as it does not make any assumptions about the distribution of the
independent variables. The dependent factor is known as response factor. In this model
building process, various log odds related to response factors are modelled. As a special
case, if response factor has only two categories with probabilities p1 and p2 respectively then
the odds of getting category one is (p1 / p2). If log (p1 / p2) is modelled using ANalysis Of
VAriance (ANOVA) type of model, it is called logit model. Again, if the same model is
being treated as regression type model then it is called logistic regression model. In a real
Logistic Regression
sense, logit and logistic are names of transformations. In case of logit transformation, a
number p between values 0 and 1 is transformed with log {p/(1-p)}, whereas in case of
logistic transformation a number x between - ∞ to + ∞ is transformed with {ex /(1 + ex)}
function. It can be seen that these two transformation are reverse of each other i.e. if logit
transformation is applied on logistic transformation function, it provides value x and
similarly, if logistic transformation is applied to logit transformation function it provides
value p. Apart from logit or logistic regression models, other techniques such as CART i.e.
Classification and Regression Trees can also be used to address such classification problems.
A good account of literature on logistic regression are available, to cite a few, Fox(1984),
Klienbaum (1994) etc.
If the response is binary, then the error terms can take on two values, namely,
εi 1 πi when Yi =1
εi πi when Yi =0
Because the error is dichotomous (discrete), normality assumption is violated. Moreover, the
error variance is given by:
V(εi ) πi (1-πi )2 (1-πi )(-πi )2
πi (1-πi )
It can be seen that variance is a function of πi 's and it is not constant. Therefore the
assumption of homoscadasticity (equal variance) does not hold.
Logistic regression models are more appropriate when response variable is qualitative and a
non-linear relationship can be established between the response variable and the qualitative
and quantitative factors affecting it. It addresses the same questions that discriminant
function analysis and multiple regression do but with no distributional assumptions on the
predictors. In logistic regression model, the predictors need not have to be normally
distributed, the relationship between response and predictors need not be linear or the
observations need not have equal variance in each group etc. A good account on logistic
regression can be found in Fox (1984) and Kleinbaum (1994).
The problem of non-normality and heteroscadasticity (see section 2) leads to the non
applicability of least square estimation for the linear probability model. Weighted least
square estimation, when used as an alternative, can cause the fitted values not constrained to
the interval (0, 1) and therefore cannot be interpreted as probabilities. Moreover, some of the
error variance may come out to be negative. One solution to this problem is simply to
constrain π to the unit interval while retaining the linear relation between π and regressor X
within the interval. Thus
0 ,β 0 +β1X 0
π = β0 +β1X , 0 β0 +β1X 1
1 ,β0 +β1X 1
However, this constrained linear probability model has certain unattractive features such as
abrupt changes in slope at the extremes 0 and 1 making it hard for fitting the same on data. A
smoother relation between π and X is generally more sensible. To correct this problem, a
positive monotone (i.e. non-decreasing) function is required to transform (β0 + β1xi) to unit
interval. Any cumulative probability distribution function (CDF) P, meets this requirement.
That is, respecify the model as πi = P (β0 + β1xi). Moreover, it is advantageous if P is strictly
increasing, for then, the transformation is one-to-one, so that model can be rewritten as P-
1
(πi) = (β0 + β1xi), where P-1 is the inverse of the CDF P. Thus the non-linear model for
itself will become both smooth and symmetric, approaching π = 0 and π = 1 as asymptotes.
Thereafter maximum likelihood method of estimation can be employed for model fitting.
0.6
0.5
0.4
0.3
0.2
0.1
0
X
Logistic Regression
The shape of the S-curve can be reproduced if the probabilities can be modeled with only
one predictor variable as follows:
π = P(Y=1|X= x) = 1/(1+e-z )
where z = β0 + β1x, and e is the base of the natural logarithm. Thus for more than one (say r)
explanatory variables, the probability π is modeled as
π = P(Y=1|X1 = x1...X r = x r )
=1/(1+e-z )
To explain the popularity of logistic regression, let us consider the mathematical form on
which the logistic model is based. This function, called f (z), is given by
f (z) = 1/ (1+e-z) , -∞ < z < ∞
Now when z = -∞, f (z) =0 and when z = ∞, f (z) =1. Thus the range of f (z) is 0 to1. So the
logistic model is popular because the logistic function, on which the model is based, provides
Estimates that lie in the range between zero and one.
An appealing S-shaped description of the combined effect of several explanatory
variables on the probability of an event.
i=1 i=1
where L (β0, β1) replaces g (Y1… Yn) to show explicitly that the function can now be viewed
as the likelihood function of the parameters to be estimated, given the sample observations.
The maximum likelihood estimates β0 and β1 in the simple logistic regression model are those
values of β0 and β1 that maximize the log-likelihood function. No closed-form solution exists
for the values of β0 and β1 that maximize the log-likelihood function. Computer intensive
numerical search procedures are therefore required to find the maximum likelihood estimates
β̂0
and β̂1 . Standard statistical software programs such as SAS (PROC LOGISTIC), SPSS
(Analyze- Regression-Binary Logistic) provide maximum likelihood estimates for logistic
regression. Once these estimates β̂0 and β̂1 are found, by substituting these values into the
response function the fitted response function, say, π̂i , can be obtained. The fitted response
function is as follows:
1
π̂i =
1+ e- β0 +β1Xi
ˆ ˆ
When log of the odds of occurrence of any event is considered using a logistic regression
model, it becomes a case of logit analysis. Here the thus formed logit model will have its
right hand side as a linear regression equation.
4. Model Validation
The model validation can be done by employing various tests on any fitted logistic regression
model. The tests related to the significance of the estimated parameters, goodness of fit and
predictive ability of the models are discussed subsequently.
where
Oi = the observed number of events in the ith group
Ni = the number of subjects in ith group
and ˆi = the average estimated probability of an event in the ith group.
The Somers’D is a simple modification of gamma. Unlike gamma, the Somers' D includes tied
pairs in one way or another. Somers’D is defined as
Ns - N d
Ns + Nd + Ty
Logistic Regression
where T is the number of pairs tied on the dependent variable, Y. Somers' d ranges from -1.0
y
(for negative relationships) to 1.0.
Hit rate: Number of correct predictions divided by sample size. The hit rate for the model
should be compared to the hit rate for the classification table for the constant-only model.
Sensitivity: Percent of correct predictions in the reference category (usually 1) of the
dependent. It also refers to the ability of the model to classify an event correctly.
Specificity: Percent of correct predictions in the given category (usually 0) of the dependent. It
also refers to ability of the model to classify a non event correctly.
False positive rate: It is the proportion of predicted event responses that were observed as
nonevents
False negative rate: It is the proportion of predicted nonevent responses that were observed as
events.
Higher the sensitivity and specificity lower the false positive rate and false negative rate, better
the classificatory ability.
Mood of boss
Good Bad
Weather
Rain 82 18
Shine 60 40
Odds of an event is the ratio of the probability of an event occurring to the probability of it
not occurring. That is,
Odds=P(event)/{1-P(event)} = P(event=1)/P(event=0)
In the above table, there is 82% probability that the mood of the boss will be ‘Good’ in case
of ‘Rain’. The odds of ‘Good mood’ in ‘Rain’ category =0.82/0.18 =4.5. The odds of ‘Good
Logistic Regression
mood’ in ‘Shine’ category =0.60/0.40 =1.5. The odds ratio of ‘Rain’ to ‘Shine’ equals
(4.5/1.5) =3 indicating that the odds of getting ‘Boss in good mood’ during ‘Rain’ is three
times those during ‘Shine’. Also there is 18% probability that mood of boss will be ‘Bad’ in
case of ‘Rain’; the odds of ‘Bad mood’ in ‘Rain’ =0.18/0.82 =0.22. Thus, in case the
probability is very small (0.18 in this case), there is no appreciable difference in mentioning
the same as probability or odds.
The importance of odds ratio is case of logistic regression modeling can be further explained
by taking a simple case of influence of an attribute “Gender”
X with two levels (Male or Female) on another attribute “opinion towards legalized abortion”
Y with two levels (Yes=1, No=0). Logistic regression when written in its linearised form
takes the following ‘logit’ form:
log it
log X g
1
where, ‘alpha’ is the intercept parameter and ‘beta’ is a vector of slope parameters. In case
response variable has ordinal categories say 1,2,3,--------, I, I+1 then generally logistic
model is fitted with common slope based on cumulative probabilities of response categories
instead of individual probabilities. This provides parallel lines of regression model with
following form
g [Prob ( y ix )] = i x , 1 i I
where, 1, 2 , k , are k intercept parameters and is the vector of slope
parameters.
Multinomial logistic regression (taking qualitative response variable with three categories,
for simplicity) is given by
where j are two intercept parameters (1 < 2 ), T = (1, 2, …….,k) is the slope
parameter vector not including the intercept terms, XT = (X1, X2, ….,Xk) is vector of
explanatory variables. This model fits a common slope cumulative model i.e. ‘parallel lines’
regression model based on the cumulative probabilities of the response categories.
Logistic Regression
log 1 1 1X1 2 X 2 ........ k X k ,
logit(1) = 1 1
2
log 1 2 1X1 2 X 2 ........ k X k
1 1 2
logit(1 + 2) =
where
\
T
e 1 X
1 (X) T
1 e 1 X
T
e 2 X
1 (X) 2 (X) T
1 e 2 X
1 + 2 + 3 = 1
Consider the dataset given in the Table given below. Weather data during 1987-97 in Kakori
and Malihabad mango (Mangifera indica L.) belt (Lucknow) of Uttar Pradesh is used here to
develop logistic regression models for forewarning powdery mildew caused by Oidium
mangiferae Berthet and validated the same using data of recent years. The forewarning
system thus obtained satisfactorily forewarns with the results obtained comparing well with
the observed year-wise responses. The status of the powdery mildew (its epidemic and
spread) during 1987-97 are given in the following table, with the occurrence of the epidemic
denoted by 1 and 0 otherwise. The variables used were maximum temperature (X 1 ) and
relative humidity (X 2 ). The model is given by
P (Y=1) = 1/ [1+exp {- (β 0 + β 1 x 1 + β 2 x 2 )}]
Table: Epidemic status (Y) of powdery mildew fungal disease in Mango in U.P.
Logistic Regression
Logistic regression models were developed using the maximum likelihood estimation
procedure in SAS. Consider 1987-96 model based on second week of March average weather
data using which forewarning probability is obtained for the year 1997. The parameter
^ ^
estimates corresponding to intercept, X1 and X2 are obtained as 0 = -72.47; 1 = 1.845;
^
2 = 0.22.
Plugging in the values X 1 = 31.50 and X 2 = 68.29, of year 1997 it can be seen that P(Y=1)
= 0.66. This is the forewarning probability of occurrence of powdery mildew in mango using
logistic regression modeling for 1997. The logistic regression model yielded good results. If
P (Y=1) <0.5, then probability that epidemic will occur is minimal, otherwise there is more
chance of occurrence of epidemic and this can be taken as objective procedure of
forewarning the disease. As we were having the information that there was epidemic during
the year 1997, it can be seen that the logistic regression model forewarns the actual status
correctly.
;
proc logistic data=PMildew;
model epidemic(event='1') = MaxT RH / lackfit;
run;
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Consider as another example, data from the field of medical sciences relating to Occurrence
or Non-occurrence of Coronary Heart Disease (CHD) in human beings as given in the
following table.
data medical;
input age n CHD;
cards;
25 10 1
30 15 2
35 12 3
40 15 5
45 13 6
50 8 5
55 17 13
60 10 8
;
proc logistic data=medical;
Logistic Regression
Response Profile
Ordered Binary Total
Value Outcome Frequency
1 Event 43
2 Nonevent 57
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.1092 1.0852 22.1641 <.0001
age 1 0.1116 0.0241 21.4281 <.0001
Testing of overall Null Hypothesis that BETA = 0 using Likelihood and other tests indicate that
they are highly significant and hence there is considerable effect on age on CHD disease.
The Hosmer-Lemeshow Goodness of Fit Test with 6 degrees of freedom suggests that the fitted
model is adequate. Here one has to see for a large p-value (>0.05). in order to infer that the
model is very well fitted.
Here "Correct" columns list the numbers of subjects that are correctly predicted as events and
nonevents. Also "Incorrect" columns list both the number of nonevents incorrectly predicted as
events and the number of events incorrectly predicted as nonevents.
Logistic Regression
FALSE positive and FALSE negative rates are low, sensitivity (the ability of the model to
predict an event correctly) (84.2%) and specificity (the ability of the model to predict a
nonevent correctly) (60.5%) of the model are high enough and hence the fitted model is very
effective for prediction/ classification.
Title "Logistic Regression for Vellore District with three levels of dependent variable";
ods html;
proc logistic data=work.vellore outest = betas covout;
weight WeightLevel2;
output out=work.outputfile/*p=phat lower=lcl upper=ucl*/
predprob=(individual);
run;
Response Profile
Ordered Group Total Total
Value Code Frequency Weight
1 1 117 261942.85
2 2 111 208105.93
3 3 12 22104.90
HHSize 1 1 0
2 0 1
3 -1 -1
HHType 1 1 0
2 0 1
3 -1 -1
Religion 1 1 0
2 0 1
3 -1 -1
SocialGroup 1 1
2 -1
Logistic Regression
Here we can see that there are two fitted equations of the logistic regression will be obtained
if there are three levels of the response variable.
References:
Fox, J. (1984). Linear statistical models and related methods with application to social
research, Wiley, New York.
Kleinbaum, D.G. (1994). Logistic regression: A self learning text, New York: Springer.