0% found this document useful (0 votes)
2 views

Logistic regression

This paper provides an overview of logistic regression (LR) as a key method in data mining and binary classification, highlighting its algorithmic applications and effectiveness in handling imbalanced and rare events data. It discusses the advantages of LR, including its ability to provide probabilities and extend to multi-class problems, as well as correction techniques necessary for accurate modeling in specific data scenarios. The paper aims to guide researchers in selecting appropriate LR techniques based on their data characteristics while emphasizing the flexibility and robustness of the method.

Uploaded by

bbdckrsdn8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Logistic regression

This paper provides an overview of logistic regression (LR) as a key method in data mining and binary classification, highlighting its algorithmic applications and effectiveness in handling imbalanced and rare events data. It discusses the advantages of LR, including its ability to provide probabilities and extend to multi-class problems, as well as correction techniques necessary for accurate modeling in specific data scenarios. The paper aims to guide researchers in selecting appropriate LR techniques based on their data characteristics while emphasizing the flexibility and robustness of the method.

Uploaded by

bbdckrsdn8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Int. J. Data Analysis Techniques and Strategies, Vol. 3, No.

3, 2011 281

Logistic regression in data analysis: an overview

Maher Maalouf
School of Industrial Engineering,
University of Oklahoma,
202 W. Boyd St., Room 124,
Norman, OK, 73019, USA
E-mail: [email protected]

Abstract: Logistic regression (LR) continues to be one of the most widely used
methods in data mining in general and binary data classification in particular.
This paper is focused on providing an overview of the most important aspects
of LR when used in data analysis, specifically from an algorithmic and machine
learning perspective and how LR can be applied to imbalanced and rare events
data.

Keywords: data mining; logistic regression; LR; classification; rare events;


imbalanced data.

Reference to this paper should be made as follows: Maalouf, M. (2011)


‘Logistic regression in data analysis: an overview’, Int. J. Data Analysis
Techniques and Strategies, Vol. 3, No. 3, pp.281–299.

Biographical notes: Maher Maalouf received his PhD in Industrial


Engineering in 2009 from the University of Oklahoma, OK. He is a Post
Doctorate Research Associate in the Department of Industrial Engineering at
the University of Oklahoma. His research interests include operations research,
data mining and machine learning methods, multivariate statistics, optimisation
methods and robust regression.

1 Introduction

Logistic regression (LR) is one of the most important statistical and data mining
techniques employed by statisticians and researchers for the analysis and classification
of binary and proportional response datasets (Agresti, 2007; Hastie et al., 2009;
Hilbe, 2009; Kleinbaum et al., 2007). Some of the main advantages of LR are that it
can naturally provide probabilities and extend to multi-class classification problems
(Hastie et al., 2009; Karsmakers et al., 2007). Another advantage is that most of
the methods used in LR model analysis follow the same principles used in linear
regression (Hosmer and Lemeshow, 2000). What’s more, most of the unconstrained
optimisation techniques can be applied to LR (Lin et al., 2007). Recently, there has
been a revival of LR importance through the implementation of methods such as the
truncated Newton. Truncated Newton methods have been effectively applied to solve
large scale optimisation problems. Komarek and Moore (Komarek and Moore, 2005b)

Copyright © 2011 Inderscience Enterprises Ltd.


282 M. Maalouf

were the first to show that the truncated-regularised iteratively-reweighted least squares
(TR-IRLS) can be effectively implemented on LR to classify large datasets, and that it
can outperform the support vector machines (SVM) (Vapnik, 1995), which is considered
a state-of-the-art algorithm. Later on, trust region Newton method (Lin et al., 2007),
which is a type of truncated Newton, and truncated Newton interior-point methods (Koh
et al., 2007) were applied for large scale LR problems.
With regard to imbalanced and rare events data, and/or small samples as well
as certain sampling strategies (such as choice-based sampling), however, the standard
binary methods, including LR, are inconsistent unless certain corrections are applied.
The most common correction techniques are prior correction and weighting (King and
Zeng, 2001). King and Zeng (2001) applied these corrections to the LR model, and
showed that they can make a difference when the population probability of interest is
low.
This paper provides an overview of some of the algorithms and the corrections that
enable LR to be both fast and accurate from a machine learning point of view. It is
by no means an exhaustive survey of all the LR techniques in data mining. Rather, the
objective is to enable researchers to choose the right techniques based on the type of data
they deal with while at the same time expose them to the flexibility and effectiveness of
the LR method. In fact, binary LR models are the foundation from which more complex
models are constructed (Long and Freese, 2005). The techniques presented in this paper
apply mainly to continuous, large-scale categorical response as well as imbalanced and
rare-event datasets with no missing values.
Section 2 provides a description of the LR method. Section 3 discusses fitting
LR with the iteratively re-weighted least squares (IRLS) technique. In Section 4, we
discuss LR in imbalanced and rare events data. Section 5 provides a brief survey of LR
pertaining to other common data mining challenges and Section 6 states the conclusion.

2 The LR model

Let X ∈ Rn×d be a data matrix where n is the number of instances (examples) and d
is the number of features (parameters or attributes), and y be a binary outcomes vector.
For every instance xi ∈ Rd (a row vector in X), where i = 1 . . . n, the outcome is either
yi = 1 or yi = 0. Let the instances with outcomes of yi = 1 belong to the positive class
(occurrence of an event), and the instances with outcomes yi = 0 belong to the negative
class (non-occurrence of an event). The goal is to classify the instance xi as positive
or negative. An instance can be thought of as a Bernoulli trial (the random component)
with an expected value E[yi ] or probability pi .
A linear regression model to describe such a problem would have the matrix form

y = Xβ + ε, (1)

where ε is the error vector, and where


       
y1 1 x11 x12 · · · x1d β0 ε1
 y2  1 x21 x22 · · · x2d  β1   ε2 
       
y =  .  , X = . . .. ..  , β =  ..  , and ε =  ..  . (2)
 ..   .. .. . .   .  .
yn 1 xn1 xn2 · · · xnd βd εn
Logistic regression in data analysis 283

The [vector ]β is the vector of unknown parameters such that xi ← [1, xi ] and
β ← β0 , β T . From now on, the assumption is that the intercept is included in the
vector β. Now, since y is a Bernoulli random variable with a probability distribution
{
pi , if yi = 1;
P (yi ) = (3)
1 − pi , if yi = 0;

then, the expected value of the response is

E[yi ] = 1(pi ) + 0(1 − pi ) = pi = xi β, (4)

with a variance

V (yi ) = pi (1 − pi ). (5)

It follows from the linear model

yi = xi β + εi (6)

that
{
1 − pi , if yi = 1 with probability pi ;
εi = (7)
−pi , if yi = 0 with probability 1 − pi ;

Therefore, εi has a binomial distribution with an expected value

E[εi ] = (1 − pi )(pi + (−pi )(1 − pi ) = 0, (8)

and a variance
[ ]
V (εi ) = E ε2i − E[εi ] = (1 − pi )2 (pi ) + (−pi )2 (1 − pi ) − (0) (9)
= pi (1 − pi ). (10)

Since the expected value and variance of both the response and the error are not constant
(heteroskedastic), and the errors are not normally distributed, the least squares approach
cannot be applied. In addition, since yi ∈ {0, 1}, then a linear regression model would
lead to values above one or below zero. Thus, when the response vector is binary, the
logistic response function, as shown in Figure 2, is the appropriate one.
The logistic function commonly used to model each positive instance xi with its
expected binary outcome is given by

ex i β 1
E[yi = 1|xi , β] = pi = = , for i = 1, . . . n. (11)
1+e x i β 1 + e−xi β

The logistic (logit) transformation is the logarithm of the odds of the positive response,
and is defined as
( )
pi
ηi = g(pi ) = ln = xi β. (12)
1 − pi
284 M. Maalouf

In matrix form, the logit function is expressed as

η = Xβ. (13)

The logit transformation function is important in the sense that it is linear and hence it
has many of the properties of the linear regression model. In LR, this function is also
called the canonical link function, which relates the linear predictor ηi to E[yi ] = pi
through g(pi ). In other words, the function g(.) links E[yi ] to xi through the linear
combination of xi and β (the systematic component). Furthermore, the logit function
implicitly places a separating hyperplane, β0 + ⟨x, β⟩ = 0, in the input space between
the positive and non-positive instances.

Figure 1 Logistic response function

1
0.9
0.8
0.7
0.6
P 0.5
0.4
0.3
0.2
0.1
0
−10.00 −5.00 0.00 5.00 10.00
x

The most widely used general method of estimation is the method of maximum
likelihood (ML). The ML method is based on the joint probability density of the
observed data, and acts as a function of the unknown parameters in the model
(Garthwaite et al., 2002).
Now, with the assumption that the observations are independent, the likelihood
function is
∏n ∏n ( )yi ( )1−yi
exi β 1
L(β) = (pi )yi (1 − pi )1−yi = , (14)
i=1 i=1
1 + exi β 1 + ex i β

and hence, the log-likelihood is then,


∑n ( ( xi β ) ( ))
e 1
ln L(β) = yi ln + (1 − yi ) ln . (15)
i=1
1 + ex i β 1 + ex i β

Amemiya (1985) provides formal proofs that the ML estimator for LR satisfies the
ML estimators’ desirable properties. Unfortunately, there is no closed form solution to
Logistic regression in data analysis 285

maximise ln L(β) with respect to β. The LR maximum likelihood estimates (MLE) are
therefore obtained using numerical optimisation methods, which start with a guess and
iterate to improve on that guess. One of the most commonly used numerical methods
is the Newton-Raphson method, for which, both the gradient vector and the Hessian
matrix are needed:
∑n ( ( ) ( ))
∂ xij −xij exi β
ln L(β) = yi + (1 − yi ) (16)
∂βj i=1
1 + ex i β 1 + ex i β
∑n ( ( ) ( xi β ))
1 e
= yi xij β
− (1 − yi )x ij (17)
i=1
1+e x i 1 + ex i β

n
= (yi xij (1 − pi ) − (1 − yi )xij (pi )) (18)
i=1
∑n
= (xij (yi − pi )) = 0, (19)
i=1

where j = 0, ...d and d is the number of parameters. Each of the partial derivatives is
then set to zero. In matrix form, equation (19) is written as
g(β) = ∇β ln L(β) = XT (y − p) = 0. (20)
Now, the second derivatives with respect to β are given by
∑n ( )
∂2 −xij xik exi β
ln L(β) = (21)
∂βj ∂βk i=1
(1 + exi β )(1 + exi β )

n
= (−xij xik (pi (1 − pi ))). (22)
i=1

If vi is defined as pi (1 − pi ) and V = diag(v1 , ....vn ), then the Hessian matrix can be


expressed as
H(β) = ∇2β ln L(β) = −XT VX. (23)
Since the Hessian matrix is negative definite, then the objective function is strictly
concave, with one global maximum. The LR information matrix is given by
I(β) = −E[H(β)] = XT VX. (24)
( )−1
The variance of β is then V(β) = I(β)−1 = XT VX .
Over-fitting the training data may arise in LR (Hosmer and Lemeshow, 2000),
especially when the data are very high dimensional and/or sparse. One of the
approaches to reduce over-fitting is through quadratic regularisation, known also as
ridge regression, which introduces a penalty for large values of β and to obtain better
generalisation (Bishop, 2006). The regularised log-likelihood can be defined as
∑ n ( ( xi β ) ( ))
e 1 λ
ln L(β) = yi ln xi β
+ (1 − yi ) ln xi β
− ||β||2 (25)
i=1
1 + e 1 + e 2
∑ n ( yi xi β )
e λ
= ln xi β
− ||β||2 , (26)
i=1
1 + e 2
286 M. Maalouf

λ
where λ > 0 is the regularisation parameter and ||β||2 is the regularisation (penalty)
2
term. For binary outputs, the loss function or the deviance (DEV), also useful for
measuring the goodness-of-fit of the model, is the negative log-likelihood and is given
by the formula (Hosmer and Lemeshow, 2000; Komarek, 2004):
( )
DEV β̂ = −2 ln L(β). (27)

( )
Minimising the deviance DEV β̂ given in (27) is equivalent to maximising the
log-likelihood (Hosmer and Lemeshow, 2000). Recent studies showed that the conjugate
gradient (CG) method, when applied to the method of IRLS provides better results to
estimate β than any other numerical method (Malouf, 2002; Minka, 2003).

3 Iteratively re-weighted least squares

One of the most popular techniques used to find the MLE of β is the IRLS method,
which uses Newton-Raphson algorithm to solve LR score equations. Each iteration finds
the weighted least squares (WLS) estimates for a given set of weights, which are used
to construct a new set of weights (Garthwaite et al., 2002). The gradient and the Hessian
are obtained by differentiating the regularised likelihood in (26) with respect to β,
obtaining, in matrix form

∇β ln L(β) = XT (y − p) − λβ = 0, (28)

∇2β ln L(β) = −XT VX − λI, (29)

where I is a d × d identity matrix. Now that the first and second derivatives are
obtained, the Newton-Raphson update formula on the (c + 1) − th iteration is given by

β̂ (c+1) = β̂ (c) + (XT VX + λI)−1 (XT (y − p) − λβ̂ (c) ). (30)

Since β̂ (c) = (XT VX + λI)−1 (XT VX + λI)β̂ (c) , then (30) can be rewritten as

β̂ (c+1) = (XT VX + λI)−1 XT (VXβ̂ (c) + (y − p)) (31)


T −1 T (c)
= (X VX + λI) X Vz , (32)

where z(c) = Xβ̂ (c) + V−1 (y − p) and is referred to as the adjusted response (Hastie
et al., 2009).
Despite the advantage of the regularisation parameter, λ, in forcing positive
definiteness, if the matrix (XT VX + λI) were dense, the iterative computation could
become unacceptably slow (Komarek, 2004). This necessitates the need for a ‘trade off’
between convergence speed and accurate Newton direction (Lewis et al., 2006). The
method which provides such a trade-off is known as the truncated Newton’s method.
Logistic regression in data analysis 287

3.1 TR-IRLS algorithm

The WLS subproblem, (XT VX + λI)β̂ (c+1) = XT Vz(c) , is a linear system of d


equations and variables, and solving it is equivalent to minimising the quadratic function

1 (c+1) T ( )
β̂ (X VX + λI)β̂ (c+1) − β̂ (c+1) XT Vz(c) . (33)
2

Komarek and Moore (2005a) were the first to implement a modified linear CG to
approximate the Newton direction in solving the IRLS for LR. This technique is called
TR-IRLS. The main advantage of the CG method is that it guarantees convergence in
at most d steps (Lewis et al., 2006). The TR-IRLS algorithm consists of two loops.
Algorithm 1 represents the outer loop which finds the solution to the WLS problem
and is terminated when the relative difference of deviance between two consecutive
iterations is no larger than a specified threshold ε1 . Algorithm 2 represents the inner
loop, which solves the WLS subproblems in Algorithm 1 through the linear CG method,
which is the Newton direction. Algorithm 2 is terminated when the residual

r(c+1) = (XT VX + λI)β̂ (c+1) − XT Vz(c)

is no greater than a specified threshold ε2 . For more details on the TR-IRLS algorithm
and implementation, see Komarek (2004).
Default parameter values are given for both algorithms (Komarek and Moore, 2005a)
and are shown to provide adequate accuracy on very large datasets. For Algorithm 1,
the maximum number of iterations is set to 30 and the relative difference of deviance
threshold, ε1 , is set to 0.01. For Algorithm 2, the ridge regression parameter, λ, is set
to ten and the maximum number of iterations for the CG is set to 200 iterations. In
addition, the CG convergence threshold, ε2 , is set to 0.005, and no more than three
non-improving iterations are allowed on the CG algorithm.
Algorithm 1 LR MLE using IRLS

Data: X, y, β̂ (0)
Result: β̂
1 begin
2 c=0
(c) (c+1)
3 while | DEVDEV
−DEV
(c+1) | > ε1 and c ≤ Max IRLS Iterations do
4 for i ← 1 to n do
1
5 p̂i = −x β̂
; /* Compute probabilities */
1+e i

6 vi = p̂i (1 − p̂i ) ; /* Compute weights */


7 zi = xi β̂ (c) + p̂(y i −p̂i )
i (1−p̂i )
; /* Compute the adjusted response */
8 V = diag(v1 , ..., vn )
9 (XT VX + λX)β̂ (c+1) = XT Vz(c) ; /* Compute β̂ via WLS */
10 c=c+1
11 end
288 M. Maalouf

Algorithm 2 Linear CG. A = XT VX + λX , b = XT V z

Data: A, b, β̂ (0)
Result: β̂ such that Aβ̂ = b
1 begin
2 r(0) = b − Aβ̂ (0) ; /* Initialize the residual */
3 c=0
4 while ||r(c+1) ||2 > ε2 and c ≤ Max CG Iterations do
5 if c = 0 then
6 ζ (c) = 0
7 else
rT(c+1) r(c+1)
8 ζ (c) = rT(c+1) r(c)
; /* Update A-Conjugacy enforcer */
(c+1) (c+1) (c) (c)
9 d =r +ζ d ; /* Update the search direction */
rT(c) r(c)
10 s(c) = dT(c) Ad(c)
; /* Compute the optimal step length */
(c+1)
11 β̂ = β̂ + ζ d(c+1) ;
(c) (c)
/* Obtain approximate solution */
12 r(c+1) = r(c) − s(c) Ad(c+1) ; /* Update the residual */
13 c=c+1
14 end

Once the optimal MLE for β̂ are found, classification of any given ith instance, xi , is
carried out according to the following rules

{
1, if η̂i ≥ 0 or p̂i ≥ 0.5 ;
ŷi = (34)
0, otherwise.

Aside from the implementation simplicity of TR-IRLS, the main advantage of the
algorithm is that it can process and classify large datasets with little time compared to
other methods such as SVM. In addition, the TR-IRLS is robust to linear dependencies
and data scaling, and its accuracy is comparable to that of SVM. Furthermore, the
algorithm does not require parameter tuning. This is an important characteristic when
the goal is to classify large and balanced datasets.
Despite all of the aforementioned advantages of TR-IRLS, the algorithm is not
designed to handle rare events data, and it is not designed to handle small-to-medium
size datasets that are highly non-linearly separable (Maalouf and Trafalis, 2008).

4 LR in imbalanced and rare events data

4.1 Endogenous (choice-based) sampling

Almost all of the conventional classification methods are based on the assumption
that the training data consist of examples drawn from the same distribution as the
testing data (or real-life data) (Visa and Ralescu, 2005; Zadrozny, 2004). Likewise in
generalised linear models (GLM), likelihood functions solved by methods such as LR
are based on the concepts of random sampling or exogenous sampling (King and Zeng,
2001; Xie and Manski, 1989). To see why this is the case (Amemiya, 1985; Cameron
Logistic regression in data analysis 289

and Trivedi, 2005), under random sampling, the true joint distribution of y and X is
P (y|X)P (X), and the likelihood function based on n binary observations is given by

n
LRandom = P (yi |xi , β)P (xi ). (35)
i=1

Under exogenous sampling, the sampling is on X according to a distribution f (X),


which may not reflect the actual distribution P (X), and then y is sampled according to
its true distribution probability P (y|X). The likelihood function would then be

n
LExogenous = P (yi |xi , β)f (xi ). (36)
i=1

As long as the ML estimator is not related to P (X) or f (X), then maximising LRandom
or LExogenous is equivalent to maximising

n
L= P (yi |xi , β), (37)
i=1

which is exactly the likelihood maximised by LR in (14) (Zadrozny, 2004).


While the ML method is the most important method of estimation with a great
advantage in general applicability, it is well-known that MLE of the unknown
parameters, with exception to the normal distribution, are asymptotically biased in
small samples. The ML properties are satisfied mainly asymptotically, meaning with
the assumption of large samples (Garthwaite et al., 2002; Collett, 2003). In addition,
while it is ideal that sampling be either random or exogenous since it is reflective of
the population or the testing data distribution, this sampling strategy has three major
disadvantages when applied to REs. First, in data collection surveys, it would be very
time consuming and costly to collect data on events that occur rarely. Second, in
data mining, the data to be analysed could be very large in order to contain enough
REs, and hence computational time could be big. Furthermore, while the ML estimator
is consistent in analysing such data, it is asymptotically biased in the sense that the
probabilities generated underestimate the actual probabilities of occurrence. In other
words, the results are asymptotically biased. Cox and Hinkley (1979) provided a general
rough approximation for the asymptotic bias, developed originally by Cox and Snell
(1968), such that
1 i30 + i11
E[β̂ − β] = − , (38)
2n i220
[( ) ] [( ) ( 2 )] [( )2 ]
3
where i30 = E ∂L
∂β , i11 = E ∂L
∂β
∂ L
∂β 2 , and i20 = E ∂L
∂β , are
1
evaluated at β̂. Following King and Zeng (2001), if pi = 1+e−β0 +xi
, then the asymptotic
bias is
2
1 E[(0.5 − p̂i )((1 − p̂i )2 yi + p̂2i (1 − yi ))]
E[β̂0 − β0 ] = − (39)
n (E[(1 − p̂i )2 yi + p̂2i (1 − yi )])
p − 0.5
≈ , (40)
np(1 − p)
290 M. Maalouf

where p is the proportion of events in the sample. Therefore, as long as p is less than 0.5
and/or n is small, the bias in (40) will not be equal to zero. Furthermore, the variance
would be large. To see this mathematically, consider the variance matrix of the LR
estimator, β̂, given by

[ ]−1

n
V(β̂) = pi (1 − pi )xTi xi . (41)
i=1

The variance given in (41) is smallest when the part pi (1 − pi ), which is affected by
rare events, is closer to 0.5. This occurs when the number of ones is large enough in the
sample. However, the estimate of pi with observations related to rare events is usually
small, and hence additional ones would cause the variance to drop while additional zeros
at the expense of events would cause the variance to increase (King and Zeng, 2001;
Palepu, 1986). The strategy is to select on y by collecting observations for which yi = 1
(the cases), and then selecting random observations for which yi = 0 (the controls). The
objective then is to keep the variance as small as possible by keeping a balance between
the number of events (ones) and non-events (zeros) in the sample under study. This
is achieved through endogenous sampling or choice − based sampling. Endogenous
sampling occurs whenever sample selection is based on the dependent variable (y),
rather than on the independent (exogenous) variable (X).
However, since the objective is to derive inferences about the population from
the sample, the estimates obtained by the common likelihood using pure endogenous
sampling are inconsistent. King and Zeng (2001) recommend two methods of estimation
for choice-based sampling, prior correction and weighting.

4.2 Correcting estimates under endogenous sampling

4.2.1 Prior correction


Consider a population of N examples with τ as the proportion of events and (1 − τ ) as
the proportion of non-events. Let the event of interest be y = 1 in the population with a
probability p̃. Let, n be the sample size with y and (1 − y) representing the proportions
of events and non-events in the sample, respectively. Then, let p̂ be the probability of the
event in the sample, and s = 1 be a selected event. By the Bayesian formula (Palepu,
1986; Cramer, 2003),

P (s = 1|y = 1)P (y = 1)
p̂ = P (y = 1|s = 1) =
P (s = 1|y = 1)P (y = 1) + P (s = 1|y = 0)P (y = 0)
(42)
( )
y
τ p̃
=( ) ( ) . (43)
y
τ p̃ + 1−y
1−τ (1 − p̃)

If the sample is random, then y = τ and 1 − y = 1 − τ , hence p̂ = p̃ and there is no


inconsistency. When endogenous sampling is used to analyse imbalanced or rare events
data, τ < (1 − τ ), and y ≈ (1 − y), and hence p̂ ̸= p̃, regardless of the sample size.
Logistic regression in data analysis 291
( )
y
Now, assuming that p̂ and p̃ are logit probabilities, and let ν1 = , and
( ) τ
1−y
ν0 = , then equation (43) can be rewritten as
1−τ

ν1 p̃
p̂ = . (44)
ν1 p̃ + ν0 (1 − p̃)

The odd of (44) is then

p̂ ν1 p̃
O= = , (45)
1 − p̂ ν0 (1 − p̃)

and the log odds is


( )
ν1
ln(O) = ln + ln (p̃) − ln(1 − p̃) , (46)
ν0

which implies that


[( )( )]
1−τ y
xβ̂ = ln + xβ̃. (47)
τ 1−y

Prior correction is therefore easy to apply as it involves only correcting the intercept
(King and Zeng, 2001; Cramer, 2003), β0 , such that
[( )( )]
1−τ y
β̃0 = β̂0 − ln , (48)
τ 1−y

thereby, making the corrected logit probability be

1
p̃i = y , for i = 1 . . . n. (49)
1+e τ )( 1−y )]−xi β
ln[( 1−τ

Prior correction requires knowledge of the fraction of events in the population, τ . The
advantage of prior correction is its simplicity. However, the main disadvantage of this
correction is that if the model is misspecified, then estimates on both β̂0 and β̂ are less
robust than weighting (King and Zeng, 2001; Xie and Manski, 1989).

4.2.2 Weighting
Under pure endogenous sampling, the conditioning is on X rather than y (Cameron
and Trivedi, 2005; Milgate et al., 1990), and the joint distribution of y and X in the
sample is

fs (y, X|β) = Ps (X|y, β)Ps (y), (50)

where β is the unknown parameter to be estimated. Yet, since X is a matrix of


exogenous variables, then the conditional probability of X in the sample is equal to that
292 M. Maalouf

in the population, or Ps (X|y, β) = P (X|y, β). However, the conditional probability in


the population is

f (y, X|β)
P (X|y, β) = , (51)
P (y)

but

f (y, X|β) = P (y|X, β)P (X), (52)

and hence, substituting and rearranging yields


Ps (y)
fs (y, X|β) = P (y|X, β)P (X) (53)
P (y)
H
= P (y|X, β)P (X), (54)
Q
H Ps (y)
where = . The likelihood is then
Q P (y)


n
Hi
LEndogenous = P (yi |xi , β)P (xi ), (55)
i=1
Qi
( ) ( )
Hi y 1−y
where = yi + (1 − yi ). Therefore, when dealing with REs and
Qi τ 1−τ
imbalanced data, it is the likelihood in (55) that needs to be maximised (Amemiya, 1985;
Xie and Manski, 1989; Cameron and Trivedi, 2005; Manski and Lerman, 1977; Imbens
and Lancaster, 1996). Several consistent estimators of this type of likelihood have been
proposed in the literature. Amemiya (1985) and Ben Akiva and Lerman (1985) provide
an excellent survey of these methods.
Manski and Lerman (1977) proposed the weighted exogenous sampling maximum
likelihood (WESML), and proved that WESML yields a consistent and asymptotically
normal estimator so long as knowledge of the population probability is available. More
recently, Ramalho and Ramalho (2007) extended the work of Manski and Lerman
(1977) to cases where such knowledge may not be available. Knowledge of population
probability or proportions, however, can be acquired from previous surveys or existing
databases. The log-likelihood for LR can then be rewritten as
∑n
Qi
ln L(β|y, X) = ln P (yi |xi , β) (56)
i=1
Hi
∑n ( yi xi β )
Qi e
= ln (57)
i=1
Hi 1 + ex i β
∑n ( yi xi β )
e
= wi ln , (58)
i=1
1 + ex i β

Qi
where wi = . Thus, in order to obtain consistent estimators, the likelihood is
Hi
multiplied by the inverse of the fractions. The intuition behind weighting is that if the
Logistic regression in data analysis 293

proportion
( ) of events in the sample is more than that in the population, then the ratio
Q
< 1 and hence the events are given less weight, while the non-events would be
H
given more weight if their proportion in the sample is less than that in the population.
This estimator, however, is not fully efficient, because the information matrix equality
does not hold. This is demonstrated as
[ ] [( )( )T ]
Q 2 Q Q
−E ∇ ln P (y|X, β) ̸= E ∇β ln P (y|X, β) ∇β ln P (y|X, β) , (59)
H β H H

and for the LR model it is


[ n ( ) ] [ n ( ) ]
1 ∑ Qi 1 ∑ Qi
2
− pi (1 − pi )Xi Xj ̸= pi (1 − pi )Xi Xj . (60)
n i=1 Hi n i=1 Hi

n ( ) n ( )2
1 ∑ Qi 1 ∑ Qi
Let A = pi (1 − pi )Xi Xj , and B = pi (1 − pi )Xi Xj , then
n i=1 Hi n i=1 Hi
the asymptotic variance matrix of the estimator β is given by the sandwich estimate,
such that V(β) = A−1 BA−1 (Amemiya, 1985; Xie and Manski, 1989; Manski and
Lerman, 1977).
Now that consistent estimators are obtained, finite-sample/rare-event bias corrections
could be applied. King and Zeng (2001) extended the small-sample bias corrections, as
described by McCullagh and Nelder (1989), to include the weighted likelihood (58),
and demonstrated that even with choice-based sampling, these corrections can make a
difference when the population probability of the event of interest is low. According
to McCullagh and Nelder (1989), and later Cordeiro and McCullagh (1991), the bias
vector is given by

bias(β̂) = (XT VX)−1 XT Vξ, (61)

where ξi = Qii (pˆi − 21 ), and Qii are the diagonal elements of Q = X(XT VX)−1 XT ,
which is the approximate covariance matrix of the logistic link functionη. The
second-order bias-corrected estimator is then,

β̃ = β̂ − bias(β̂). (62)

As for the variance matrix V(β̃) of β̃, it is estimated using


( )2
n
V(β̃) = V(β̂). (63)
n+d
( )2
n
Since < 1, then V(β̃) < V(β̂), and hence both the variance and the bias are
n+d
now reduced.
The main advantage then of the bias correction method proposed by McCullagh
and Nelder (1989) is that it reduces both the bias and the variance (King and Zeng,
2001). The disadvantage of this bias correction method is that it is corrective and not
294 M. Maalouf

preventive, since it is applied after the estimation is complete, and hence it does not
protect against infinite parameter values that arise from perfect separation between the
classes (Heinze and Schemper, 2001; Wang and Wang, 2001). Hence, this bias correction
method can only be applied if the estimator, β̂, has finite values. Firth (1993) proposed
a preventive second-order bias correction method by penalising the log-likelihood such
that


n ( )
eyi xi β 1
ln L(β|y, X) = ln + ln |I(β)|, (64)
i=1
1 + ex i β 2

which leads to a modified score equation given by

∂ ∑ n
ln L(β) = (xij (yi − pi + hi (0.5 − pi ))) = 0, (65)
∂βj i=1

where hi is the ith diagonal element of the hat matrix

H = V 2 X(XT VX)−1 XT V 2 .
1 1
(66)

A recent comparative simulation study by Maiti and Pradhan (2008) showed that the
bias correction of McCullagh and Nelder (1989), provides the smallest mean squared
error (MSE) when compared to that of Firth (1993) and others using LR. Cordeiro and
Barroso (2007) more recently derived a third-order bias corrected estimator and showed
that in some cases it could deliver improvements in terms of bias and MSE over the
usual ML estimator and that of Cordeiro and McCullagh (1991).
Now, as mentioned earlier, LR regularisation is used in the form of the ridge
penalty λ2 ||β||2 . When regularisation is introduced, none of the coefficients is set to
zero (Park and Hastie, 2008), and hence the problem of infinite parameter values
is avoided. In addition, the importance of the parameter λ lies in determining the
bias-variance trade-off of an estimator (Cowan, 1998; Maimon and Rokach, 2005).
When λ is very small, there is less bias but more variance. On the other hand, larger
values of λ would lead to more bias but less variance (Berk, 2008). Therefore, the
inclusion of regularisation in the LR model is very important to reduce any potential
inefficiency. However, as regularisation carries the risk of a non-negligible bias, even
asymptotically (Berk, 2008), the need for bias correction becomes inevitable (Maalouf
and Trafalis, 2011). In sum, bias correction is needed to account for any bias resulting
from regularisation, small samples, and rare events.
The challenge remains on finding the best class distribution in the training dataset.
First, when both the events and non-events are easy to collect and both are available,
then a sample with equal number of ones and zeros would be generally optimum
(Cosslett, 1981; Imbens, 1992). Second, when the number of events in the population
is very small, the decision is then how many more non-events to collect in addition
to the events. If collecting more non-events is inexpensive, then the general judgment
is to collect as many non-events as possible. However, as the number of non-events
exceed the number of events, the marginal contribution to the explanatory variables’
information content starts to drop, and hence the number of zeros should be no more
than two to five times the number of ones (King and Zeng, 2001).
Logistic regression in data analysis 295

Applying the above corrections, offered by King and Zeng (2001), along with the
recommended sampling strategies, such as collecting all of the available events and only
a matching proportion of non-events, could:
1 significantly decrease the sample size under study
2 cut data collection costs
3 increase the rare event probability
4 enable researchers to focus more on analysing the variables.

5 LR and other data mining challenges

Data quality problems, such as noise, missing data, collinearity, redundant attributes,
and over-fitting, are common in data mining. Several strategies and LR models have
been proposed in the literature to deal with those issues. Noise is defined as any random
error in the data mining process (Mitsa, 2010). Bi and Jeske (2010) compared LR to
the normal discriminant analysis (NDA) method and showed that LR is more efficient
and less deteriorated than NDA in the presence of class-conditional classification noise
(CCC-noise). With regard to missing data, Hotron and Kleinman (2007) described
different methods to fit LR models with missing data and reviewed their implementation
using general-purpose statistical software. Applying LR models in missing data has also
been addressed by Agresti (2002).
The problem of LR in the presence of collinearity (when two or more variables
are highly correlated) has been addressed in Rossi (2009). The stepwise LR method,
well described by Hosmer (2000), is an effective method for feature selection and for
reducing the number of redundant and/or irrelevant variables. The method is relatively
easy to apply as it includes or excludes variables based on the resulting deviance of
the fitted model that is associated with those variables. Active learning, a recent and
popular method for reducing the size of the training data based on incremental learning,
is very useful for mining massive datasets, specially when labeling can be expensive (Pal
and Mitra, 2004). Schein (Schein and Ungar, 2007) provides an excellent evaluation of
different active learning techniques for LR. To achieve good generalisation and to avoid
the problem of over-fitting, an n-fold cross-validation (ten folds are usually adequate) is
performed (Efron and Tibshirani, 1994). The n-fold cross-validation divides the dataset
into n folds, retaining n − 1 folds for training while the remaining fold is used for
testing. This is done iteratively until all of the folds have been used as testing datasets
(Berthold and Hand, 2010). The results are then averaged to produce the final accuracy.
For data with dichotomous, polychotomous (multinomial), and continuous independent
variables, Agresti (2007) and Hosmer (Hosmer and Lemeshow, 2000) provide the
appropriate analyses and LR models.
Boosting (Brazdil et al., 2008) is an important machine-learning method and is used
to improve the accuracy of any given classifier. Iterative algorithms such as AdaBoost
(Freund and Schapire, 1999) assign different weights to the training distribution in each
iteration. After each iteration, boosting increases the weights associated with incorrectly
classified examples and decreases the weights associated with the correctly classified
ones. A variant of AdaBoost, AdaCost (Fan et al., 1999), has been shown useful in
addressing the problem of rarity and imbalance in data. Analysis of boosting techniques,
296 M. Maalouf

however, showed that boosting is tied to the choice of the base learning algorithm (Joshi
et al., 2001, 2002). Thus, if the base learning algorithm is a good classifier without
boosting, then boosting would be useful when that base learner is used in REs. With
regard to LR, Friedman et al. (2000) suggested using the logistic function to estimate
the probabilities of the AdaBoost output. The authors showed that the properties of
their proposed method, LogitBoost, are identical to those of AdaBoost. Collins et al.
(2002) proposed a more direct modification of AdaBoost for the logistic loss function
by deriving their algorithm using a unification of LR and boosting based on Bregman
distances. Furthermore, they also generalised the algorithm for multiple classes.
LR linearity may be an obstacle to handling highly non-linearly separable
small-to-medium size datasets (Komarek, 2004). When analysing highly non-linear
datasets, the underlying assumption of linearity in the LR model, as evident in its logit
function, is often violated (Hastie et al., 2009). With the advancement of kernel methods,
the search for an effective non-parametric LR model, capable of classifying non-linearly
separable data, has become possible. Like LR, Kernel logistic regression (KLR) can
naturally provide probabilities and extend to multi-class classification problems (Hastie
et al., 2009; Karsmakers et al., 2007). Interested readers should consult papers written
on KLR (Canu and Smola, 2005; Jaakkola and Haussler, 1999; Hastie et al., 2009;
Keerthi et al., 2005) and how some of the methods described in this overview can be
effectively extended to KLR (Roth, 2001; Maalouf and Trafalis, 2008, 2011).

6 Conclusions

LR provides a great means for modelling binary as well as multiple class response
variable dependence on one or more independent variables. Those independent variables
could be categorical or continuous, or both. The fit of the resulting model can be
assessed using a number of methods, the most important of which is the IRLS method,
which in turn is best solved using the CG method in the form of the truncated
Newton method. Furthermore, with regard to imbalanced and rare events datasets,
certain sampling strategies and appropriate corrections should be applied to the LR
method. The most common correction techniques are prior correction and weighting.
In addition, LR is adaptable to handle other data mining challenges, such as the
problems of collinearity, missing data, redundant attributes and non-linear separability,
among others, making LR a powerful and resilient data mining method. It is our hope
that this overview of the LR method and the developed state-of-the-art techniques in
it, as provided by the literature, would shed further light on this method as well as
encourage and direct future theoretical and applied research in it.

Acknowledgements

The author would like to thank Dr. S. Lakshmivarahan, Dr. Hillel Kumin and
Dr. Theodore B. Trafalis of the University of Oklahoma for their valuable input,
comments and suggestions.
Logistic regression in data analysis 297

References
Agresti, A. (2002) Categorical Data Analysis, 2nd ed., Wiley-Interscience.
Agresti, A. (2007) An Introduction to Categorical Data Analysis, Wiley-Interscience.
Amemiya, T. (1985) Advanced Econometrics, Harvard University Press.
Ben-Akiva, M. and Lerman, S. (1985) Discrete Choice Analysis: Theory and Application to Travel
Demand, The MIT Press.
Berk, R. (2008) Statistical Learning from a Regression Perspective, 1st ed., Springer.
Berthold, M.R. and Hand, D.J. (Eds.) (2010) Intelligent Data Analysis, 2nd ed., Springer.
Bi, Y. and Jeske, D.R. (2010) ‘The efficiency of logistic regression compared to normal discriminant
analysis under class-conditional classification noise’, Journal of Multivariate Analysis, Vol. 101,
No. 7, pp.1622–1637.
Bishop, C.M. (2006) Pattern Recognition and Machine Learning, Springer.
Brazdil, P., Giraud-Carrier, C., Soares, C. and Vilalta, R. (2008) Metalearning: Applications to Data
Mining, 1st ed., Springer.
Cameron, A.C. and Trivedi, P.K. (2005) Microeconometrics: Methods and Applications, Cambridge
University Press.
Canu, S. and Smola, A.J. (2005) ‘Kernel methods and the exponential family’, in ESANN, pp.447–454.
Collett, D. (2003) Modelling Binary Data, 2nd ed., Chapman & Hall/CRC.
Collins, M., Schapire, R.E. and Singer, Y. (2002) ‘Logistic regression, adaboost and bregman
distances’, Machine Learning, Vol. 48, pp.253–285.
Cordeiro, G. and Barroso, L. (2007) ‘A third-order bias corrected estimate in generalized linear
models’, TEST: An Official Journal of the Spanish Society of Statistics and Operations Research,
Vol. 16, No. 1, pp.76–89.
Cordeiro, G.M. and McCullagh, P. (1991) ‘Bias correction in generalized linear models’, Journal of
Royal Statistical Society, Vol. 53, No. 3, pp.629–643.
Cosslett, S.R. (1981) Structural Analysis of Discrete Data and Econometric Applications, The MIT
Press, Cambridge.
Cowan, G. (1998) Statistical Data Analysis, Oxford University Press.
Cox, D.R. and Hinkley, D.V. (1979) Theoretical Statistics, Chapman & Hall/CRC.
Cox, D.R. and Snell, E.J. (1968) ‘A general definition of residuals’, Journal of the Royal Statistical
Society, Vol. 30, No. 2, pp.248–275.
Cramer, J.S. (2003) Logit Models From Economics and other Fields, Cambridge University Press.
Efron, B. and Tibshirani, R.J. (1994) An Introduction to the Bootstrap, Chapman & Hall/CRC.
Fan, W., Stolfo, S.J., Zhang, J. and Chan, P.K. (1999) ‘Adacost: misclassification cost-sensitive
boosting’, in Proceedings to the 16th International Conference on Machine Learning, Morgan
Kaufmann, pp.97–105.
Firth, D. (1993) ‘Bias reduction of maximum likelihood estimates’, Biometrika, Vol. 80, pp.27–38.
Freund, Y. and Schapire, R.E. (1999) ‘A brief introduction to boosting’, in Proceedings of the 16th
International Joint Conference on Artificial Intelligence, Morgan Kaufmann, pp.1401–1406.
Friedman, J., Hastie, T. and Tibshirani, R. (2000) ‘Additive logistic regression: a statistical view of
boosting’, The Annals of Statistics, Vol. 38, No. 2, pp.337–374.
Garthwaite, P., Jolliffe, I. and Byron, J. (2002) Statistical Inference, Oxford University Press.
Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning, 2nd ed.,
Springer Verlag.
Heinze, G. and Schemper, M. (2001) ‘A solution to the problem of monotone likelihood in COX
regression’, Biometrics, Vol. 57, pp.114–119.
Hilbe, J.M. (2009) Logistic Regression Models, Chapman & Hall/CRC.
298 M. Maalouf

Horton, N.J. and Kleinman, K.P. (2007) ‘Much ado about nothing: a comparison of missing data
methods and software to fit incomplete data regression models’, The American Statistician,
Vol. 61, No. 1, pp.79–90.
Hosmer, D.W. and Lemeshow, S. (2000) Applied Logistic Regression, 2nd ed., Wiley.
Imbens, G.W. (1992) ‘An efficient method of moments estimator for discrete choice models with
choice-based sampling’, Econometrica, September, Vol. 60, No. 5, pp.1187–1214.
Imbens, G.W. and Lancaster, T. (1996) ‘Efficient estimation and stratified sampling’, Journal of
Econometrics, Vol. 74, pp.289–318.
Jaakkola, T. and Haussler, D. (1999) Probabilistic Kernel Regression Models.
Joshi, M.V., Agarwal, R.C. and Kumar, V. (2002) ‘Predicting rare classes: can boosting make any
weak learner strong?’, in KDD ’02: Proceedings of the Eighth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, New York, NY, USA, ACM, pp.297–306.
Joshi, M.V., Kumar, V. and Agarwal, R.C. (2001) ‘Evaluating boosting algorithms to classify
rare classes: comparison and improvements’, in ICDM ’01: Proceedings of the 2001 IEEE
International Conference on Data Mining, Washington, DC, USA, pp.257–264.
Karsmakers, P., Pelckmans, K. and Suykens, J.A.K. (2007) ‘Multi-class kernel logistic regression: a
fixed-size implementation’, International Joint Conference on Neural Networks, pp.1756–1761.
Keerthi, S.S., Duan, K.B., Shevade, S.K. and Poo, A.N. (2005) ‘A fast dual algorithm for kernel
logistic regression’, Mach. Learn., Vol. 61, Nos. 1—3, pp.151–165.
King, G. and Zeng, L. (2001) ‘Logistic regression in rare events data’, Political Analysis, Vol. 9,
pp.137–163.
Kleinbaum, D.G., Kupper, L.L., Nizam, A. and Muller, K.E. (2007) Applied Regression Analysis and
Multivariable Methods, 4th ed., Duxbury Press.
Koh, K., Kim, S. and Boyd, S. (2007) ‘An interior-point method for large-scale ℓ1-regularized logistic
regression’, Journal of Machine Learning Research, Vol. 8, pp.1519–1555.
Komarek, P. (2004) ‘Logistic regression for data mining and high-dimensional classification’,
PhD thesis, Carnegie Mellon University.
Komarek, P. and Moore, A. (2005a) ‘Making logistic regression a core data mining tool: a practical
investigation of accuracy, speed, and simplicity’, Technical report, Carnegie Mellon University.
Komarek, P. and Moore, A. (2005b) ‘Making logistic regression a core data mining tool with
TR-IRLS’, Proceedings of the Fifth IEEE Conference on Data Mining.
Lewis, J.M., Lakshmivarahan, S. and Dhall, S. (2006) Dynamic Data Assimilation: A Least Squares
Approach, Cambridge University Press.
Lin, C., Weng, R.C. and Keerthi, S.S. (2007) ‘Trust region Newton methods for large-scale logistic
regression’, Proceedings of the 24th International Conference on Machine Learning.
Long, J.S. and Freese, J. (2005) Regression Models for Categorical Dependent Variables Using Stata,
2nd ed., Stata Press.
Maalouf, M. and Trafalis, T.B. (2008) ‘Kernel logistic regression using truncated Newton method’,
in C.H. Dagli, D.L. Enke, K.M. Bryden, H. Ceylan and M. Gen (Eds.): Intelligent Engineering
Systems through Artificial Neural 20 M. Maalouf Networks, Vol. 18, pp.455–462, New York,
NY, USA, ASME Press.
Maalouf, M. and Trafalis, T.B. (2011) ‘Robust weighted kernel logistic regression in imbalanced and
rare events data’, Computational Statistics & Data Analysis, Vol. 55, No. 1, pp.168–183.
Maimon, O. and Rokach, L. (Eds.) (2005) Data Mining and Knowledge Discovery Handbook,
Springer.
Maiti, T. and Pradhan, V. (2008) ‘A comparative study of the bias corrected estimates in logistic
regression’, Statistical Methods in Medical Research, Vol. 17, No. 6, pp.621–634.
Malouf, R. (2002) ‘A comparison of algorithms for maximum etropy parameter estimation’,
in Proceedings of Conference on Natural Language Learning, Vol. 6.
Logistic regression in data analysis 299

Manski, C.F. and Lerman, S.R. (1977) ‘The estimation of choice probabilities from choice based
samples’, Econometrica, Vol. 45, No. 8, pp.1977–1988.
McCullagh, P. and Nelder, J. (1989) Generalized Linear Model, Chapman and Hall/CRC.
Milgate, M., Eatwell, J. and Newman, P.K. (Eds.) (1990) Econometrics, W.W. Norton & Company.
Minka, T.P. (2003) ‘A comparison of numerical optimizers for logistic regression’, Technical report,
Department of Statistics, Carnegie Mellon University.
Mitsa, T. (2010) Temporal Data Mining, Chapman & Hall/CRC.
Pal, S.K. and Mitra, P. (2004) Pattern recognition Algorithms for Data Mining, 1st ed., Chapman and
Hall/CRC.
Palepu, K. (1986) ‘Predicting takeover targets: a methodological and empirical analysis’, Journal of
Accounting and Economics, Vol. 8, pp.3–35.
Park, M.Y. and Hastie, T. (2008) ‘Penalized logistic regression for detecting gene interactions’,
Biostatistics, Vol. 9, No. 1, pp.30–50.
Ramalho, E.A. and Ramalho, J.J.S. (2007) ‘On the weighted maximum likelihood estimator for
endogenous stratified samples when the population strata probabilities are unknown’, Applied
Economics Letters, Vol. 14, pp.171–174.
Rossi, R.J. (2009) Applied Biostatistics for the Health Sciences, 1st ed., Wiley.
Roth, V. (2001) ‘Probabilistic discriminative kernel classifiers for multi-class problems’, in
Proceedings of the 23rd DAGM-Symposium on Pattern Recognition, Springer-Verlag, London,
UK, pp.246–253.
Schein, A.I. and Ungar, L.H. (2007) ‘Active learning for logistic regression: an evaluation’, Machine
Learning, Vol. 68, No. 3, pp.235–265.
Vapnik, V. (1995) The Nature of Statistical Learning, Springer, NY.
Visa, S. and Ralescu, A. (2005) ‘Issues in mining imbalanced data sets – a review paper’,
in Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference,
pp.67–73.
Wang, S. and Wang, T. (2001) ‘Precision of warm’s weighted likelihood for a polytomous model in
computerized adaptive testing’, Applied Psychological Measurement, Vol. 25, No. 4, pp.317–331.
Xie, Y. and Manski, C.F. (1989) ‘The logit model and response-based samples’, Sociological Methods
& Research, Vol. 17, pp.283–302.
Zadrozny, B. (2004) ‘Learning and evaluating classifiers under sample selection bias’, in ICML ‘04:
Proceedings of the 21st International Conference on Machine Learning, ACM, New York, NY,
USA, p.114.

You might also like