Journal of Econometrics xxx (xxxx) xxx
Contents lists available at ScienceDirect
Journal of Econometrics
journal homepage: [Link]/locate/jeconom
A latent class Cox model for heterogeneous time-to-event data
∗
Youquan Pei a , Heng Peng b , Jinfeng Xu c ,
a
School of Economics, Shandong University, China
b
Department of Mathematics, Hong Kong Baptist University, Hong Kong, China
c
Department of Biostatistics, City University of Hong Kong, Hong Kong, China
article info a b s t r a c t
Article history: Credit risk plays a vital role in the era of digital finance and it is one of primary interests
Received 2 June 2021 to identify customers with similar types of risk categories so that personalized financial
Received in revised form 6 April 2022 services can be offered accordingly. Motivated by the bourgeoning need for default risk
Accepted 31 August 2022
modeling in finance, we propose herein a latent class Cox model for heterogeneous
Available online xxxx
time-to-event data. The proposed model naturally extends the Cox proportional hazards
Keywords: model to flexibly take into account the heterogeneity of covariate effects as often
Default risk manifested in real data. Without a priori specification of the number of latent classes, it
EM algorithm simultaneously incorporates the commonalities and disparities of individual customers’
Heterogeneous covariate effects risk behaviors and provides a more refined modeling technique than existing ap-
Latent class model
proaches. We further propose a penalized maximum likelihood approach to identify the
Penalized likelihood
number of latent classes and estimate the model parameters. A modified expectation–
Survival data
maximization algorithm is then developed for its numerical implementation. Simulation
studies are conducted to assess the finite-sample performance of the proposed approach.
Its illustration with a real credit card data set is also provided.
© 2022 Elsevier B.V. All rights reserved.
1. Introduction
Following the Global Financial Crisis (GFC), a growing volume of regulation has been designed to reduce risk in the
financial system and risk management has gained more prominence. Credit risk modeling or grading now plays a vital
role in finance and has been widely used by most financial institutions all over the world. The calibration of default
risk is one of the major tasks to be accomplished by banks and other financial institutions. The main objective in credit
default analysis is to build some statistical models, fit historical data, and then classify (future) borrowers into high
or low risk groups based on their individual risk factors. Traditionally, machine learning-driven classification methods
(e.g., logistic regression, discriminant analysis, decision tree, support vector machine) are used where the binary variable
indicating whether the customer defaults during the follow-up period is taken as the response [see Shi et al. (2022) for a
comprehensive review]. However, the original time-to-default information is not fully utilized.
Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs. In the
context of credit scoring, the event of interest is usually default. The idea of applying survival analysis techniques to
credit scoring was first introduced by Thomas et al. (2002). Recently, several studies [see, e.g., Bellotti and Crook (2009),
Cao et al. (2009), Dirick et al. (2017, 2019), Bai et al. (2022), and references therein] show that survival analysis often
preforms better than classification methods since it does not only model the occurrence of default but also the time
when the default occurs.
∗ Corresponding author.
E-mail address: jinfenxu@[Link] (J. Xu).
[Link]
0304-4076/© 2022 Elsevier B.V. All rights reserved.
Please cite this article as: Y. Pei, H. Peng and J. Xu, A latent class Cox model for heterogeneous time-to-event data. Journal of Econometrics (2023),
[Link]
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
Fig. 1. Density as a function of duration time for the German credit card data.
Despite the above-mentioned advantages and recent progress in using survival analysis for credit scoring, it is con-
ventionally assumed that the conditional hazard function given the covariates for all borrowers can be well characterized
by a single regression model such as the Cox proportional hazards model. However, this assumption may be unmet in
practice. For example, in the presence of heterogeneity as often manifested by covariate effects, the subjects naturally
form into different subgroups for which separate regression models instead of one identical regression model should
be employed. In such situations, the classical Cox regression analysis will lead to biased estimates and erroneous
conclusions. The penalization-based subgroup analysis approach can be used to accommodate heterogeneous covariate
effects. For example, a concave fusion penalty is used for subgroup identification and parameter estimation with censored
data (Yan et al., 2021). However, this method does not yield a posterior probability for the subgroup membership and
hence cannot be used for prediction purposes. Alternatively, a structured mixture model is often used to overcome this
drawback (McLachlan and McGiffin, 1994). Recently, Shen and He (2015) proposed a structured logistic normal mixture
model where the subgroup membership is modeled by logistic regression and the response normal linear regression. Wu
et al. (2016) further extended it to a logistic-Cox mixture model to accommodate censored outcomes. You et al. (2018)
studied the variable selection problem in the finite-mixture Cox model. Nagpal et al. (2021) discussed some numerical
challenges in fitting the finite-mixture Cox model, used spline interpolation for the baselines, and suggested a default
set of 3 latent subgroups for its application (p. 15). However, these mixture model approaches all require the a priori
specification of the number of mixing components, which may be unrealistic in practice.
To motivate our work, we look at the German credit card data which is publicly available at the UCI repository.1 The
original data set consists of 1000 observations and 21 variables. It is of interest to examine the heterogeneity present in
this data set. Fig. 1 plots the density as a function of duration time for 1000 customers. We observe that there are several
peaks. This indicates that the distribution of duration time may be better represented by the mixing of a few distinct sub-
populations. This motivates us to consider a finite-mixture model to characterize the heterogeneity. A natural question
comes up: How many groups are there?
In fact, a key challenge in finite-mixture modeling is the selection of the number of components. On one hand, a
mixture model with too few components may not well accommodate the heterogeneity and lead to biased estimates and
incorrect conclusions. On the other hand, a mixture model with too many components may overfit the data and lack good
interpretation.
In this paper, we make contributions on three fronts. First, we propose a latent class or finite mixture Cox model
for analyzing heterogeneous survival data without a priori specification of the number of mixing components. Secondly,
1 [Link]
2
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
we develop a new penalized likelihood approach for identifying the number of latent classes and estimating the model
parameters. A modified expectation–maximization (EM) algorithm is also proposed to implement the new method.
Thirdly, we establish the asymptotic properties of the proposed approach such as the consistent identification of the
number of components and the limiting distributions of the proposed estimators.
The rest of the paper is organized as follows. In Section 2, we first introduce the latent class or finite mixture Cox
model. A penalized likelihood approach for the identification and estimation of the model is proposed and a modified
EM algorithm is used to implement it. A BIC-type criterion is also proposed to select the tuning parameters. We also give
the generalization of the model to allow the subgroup membership probability to depend on the covariates. Section 3
obtains asymptotic consistency results for identifying the number of latent classes and also establishes the asymptotic
normality of the proposed estimators of the model parameters. Numerical results including simulation study and a real
data application are presented in Section 4. Section 5 provides some concluding remarks and discusses possible future
work. Additional simulation studies and detailed proofs of the theorems are all relegated to the online Appendix.
2. The latent class Cox model for calibrating credit risk
In credit risk modeling, the event of interest is loan default and non-default events such as loan maturity or early
repayment will induce censoring. Let Ti denote the default time of subject i (i.e., the time period from the moment of
borrowing to the occurrence of default) and Ci the censoring time, respectively. The survival function Si (t) = P(Ti > t)
quantifies the probability that subject i has not yet defaulted by time t. For each subject i, we observe (Xi , Yi , δi ), where
Yi = min(Ti , Ci ) and Xi = (Xi1 , . . . , Xip )⊤ denotes the characteristics of each subject, such as age, gender, and education
level. The censoring indicator δi = I(Ti ≤ Ci ) denotes whether subject i experiences the event of interest during the
observation period (δi = 1) or not (δi = 0). The classical Cox proportional hazards model (Cox, 1972) postulates that
λi (t) = λ0 (t) exp (β⊤ Xi ), i = 1, . . . , n,
where λi (t) is the hazard function of Ti given the covariates Xi , λ0 (t) is the baseline hazard function, and β = (β1 , . . . , βp )⊤
is the covariate effects, quantifying the effect of the covariates Xi on the time to event Ti through the conditional hazard
function. Note that the covariate effects are assumed to be the same for all subjects in the population. In practice, however,
subjects may come from different subgroups and the covariate effects may differ. Therefore, it is more appropriate to
consider the following Cox proportional hazards model with group-specific covariate effects:
λk (t) = λ0k (t) exp (β⊤
k Xi ), k = 1, . . . , K ,
where βk = (βk,1 , . . . , βk,p ) are covariate effects for group k, and λ0k (t) are group-specific baseline hazard functions. We
T
are interested in estimating simultaneously the number K of groups and the group-specific coefficients βk .
2.1. Model specification and identification
The latent class or survival mixture model assumes that each data point (Xi , Yi , δi ) is drawn from a mixture density of
K (unspecified) components whose mixing proportions are (π1 , . . . , πK ), where πk = p (zi = k) is the prior probability of
subject i belonging to component k, and zi ∈ {1, . . . , K } is the hidden class label of subject i.
The survival mixture model assumes that each mixture component k is a conditional component density fk (y, δ|x).
Thus, the conditional mixture density function can be written as
K
∑
f (y, δ|x) = πk fk (y, δ|x) . (2.1)
k=1
For the Cox proportional hazards model, the conditional density function of group k is given by
( ⊤ ) y
{ ∫ }
( ⊤ )]δ
fk (y, δ|x) = λ0k (y) exp βk x exp − exp βk x λ0k (u)du .
[
(2.2)
0
It is well known that identifiability is a major issue in mixture models. We first give the identifiability condition for the
proposed finite-mixture Cox PH model.
∑K
i. Assume that 0 < πk < 1 for k = 1, . . . , K and k=1 πk = 1.
ii. Assume that β1 , β2 , . . . , βK are distinct vectors, which means that there exists at least one different component in
βk and βs for k ̸= s.
We will denote Y = (Y1 , . . . , Yn )⊤ , ∆ = (δ1 , . . . , δn )⊤ , X = (X1 , . . . , Xn )⊤ , π = (π1 , . . . , πK )⊤ , β = (β⊤
1 , . . . , βK ) . The
⊤ ⊤
unknown parameter vector θ = (π, β)⊤ is generally estimated by maximizing the observed log-likelihood
n K
∑ ∑
ℓn (θ; Y , ∆|X) = log πk fk (Yi , δi |Xi ) . (2.3)
i=1 k=1
3
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
Let zik denote the latent indicator variable according to whether sample i comes from subgroup k, and let Z = (zik )n×K .
The complete log-likelihood is given by
n K
∑ ∑
ℓc (θ; Y , ∆, Z |X) = zik log(πk ) + log fk (Yi , δi |Xi ) .
[ ]
(2.4)
i=1 k=1
The maximum likelihood estimates of π and β are denoted π̂ and β̂, respectively. Given π̂ and β̂, the conditional
probabilities of zik can be solved by the EM algorithm.
2.2. Penalized likelihood estimation via modified expectation–maximization algorithm
For a fixed number K of components, we can maximize the likelihood function (2.3) by using the EM algorithm, which
in step E computes the posterior probability of the class memberships and, in step M, estimates the mixture proportions
and unknown parameters. However, in practice, the number of components is usually unknown and needs to be inferred
from the data itself.
For the finite mixture model, the selection of the number of mixing components can be viewed as a model-selection
problem. Various conventional methods have been proposed based on the likelihood function and some information
criteria. In particular, the Bayesian information criterion (BIC) or Akaike information criterion are recommended as useful
tools for selecting the number of components [see, e.g., Dirick et al. (2017)]. Therefore, a natural idea is to propose a BIC-
type criterion for selecting the number of mixing components. However, the information criterion is a two-fold scheme
and the simulation study in the online appendix shows that it cannot consistently select the true number of mixing
components.
In this paper, we apply a penalization technique for model selection similar to that used by Huang et al. (2017). Based
on Eq. (2.4), the kth component would intuitively be eliminated if πk = 0. We start by fitting a survival-mixture model
with a large number of clusters and discard invalid clusters as the learning proceeds. However, in implementing (2.4), the
likelihood function for the complete data (zik , Xi , Yi , δi ) involves log(πk ) rather than πk . Therefore, it is natural to penalize
the logarithm of mixture proportions log(πk ). Moreover, note that the gradient of log(πk ) increases very fast when πk
is close to zero, and it would dominate the gradient of πl > 0. Consequently, the popular Lq types of penalties may
not suffice to set insignificant πk to zero. In the spirit of penalization of Huang et al. (2017), we propose the following
penalized likelihood function:
n K K
∑ ∑ ∑
ℓ P (γ , θ ) = zik log [πk fk (Yi , δi |Xi )] − nγ [log (ϵ + πk ) − log(ϵ )] , (2.5)
i=1 k=1 k=1
where γ is a tuning parameter and ϵ is a very small positive constant, say 10−6 . Note that log (ϵ + πk ) − log(ϵ ) is an
increasing function of πk and is shrunk to zero as the mixing proportion πk is close to zero. Therefore, by maximizing the
penalized likelihood function in Eq. (2.5), we can simultaneously determine the number of mixture components, estimate
mixing proportions and regression coefficients. The proposed penalized estimation procedure is given as follows.
2.2.1. Expectation step
This step computes the expectation of the penalized complete-data log-likelihood (2.5) involving the hidden class
labels Z , given the observed data D = (X, Y , ∆) and a current parameter estimation θ (m) , where m is the current iteration
number. Denote
Q (γ , θ; θ (m) ) = E ℓP (γ , θ )|D; θ (m)
[ ]
n K K
∑ ∑ ∑
E zik |D; θ (m) log [πk fk (Yi , δi |Xi )] − nγ [log (ϵ + πk ) − log(ϵ )]
[ ]
=
i=1 k=1 k=1
n K K
∑ ∑ (m)
∑
= zik log [πk fk (Yi , δi |Xi )] − nγ [log (ϵ + πk ) − log(ϵ )] ,
i=1 k=1 k=1
where
πk(m) fk Yi , δi |Xi
( )
(m)
zik = ∑K (m)
(2.6)
k=1 πk fk Yi , δi |Xi
( )
is the posterior probability that the data are generated from cluster k. This step therefore only requires the computation
(m)
of the posterior cluster probabilities zik for each of the K clusters.
4
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
2.2.2. Maximization step
This step updates the value of the parameter vector θ by maximizing the Q function (2.7), that is, by updating the
parameter vector using θ (m+1) = arg maxθ Q (γ , θ; θ (m) ). By decomposing Eq. (2.7) as
K
∑
Q (γ , θ; θ (m) ) = Q1 (γ , π; θ (m) ) + Q2,k (βk ; θ (m) ), (2.7)
k=1
where
n K K
∑ ∑ (m)
∑
Q1 (γ , π; θ (m) ) = zik log(πk ) − nγ [log (ϵ + πk ) − log(ϵ )] , (2.8)
i=1 k=1 k=1
and
n
∑ (m)
Q2,k (βk ; θ (m) ) = zik log fk (Yi , δi |Xi ) ,
[ ]
(2.9)
i=1
) separately Q1 (γ , π; θ ) with respect to
(m)
it follows that the maximization of the Q function can be done by maximizing
the mixing proportions, and for each component k, maximizing Q2,k βk , ; θ (m) with respect to the regression parameters
(
βk .
∑K proportions are updated by maximizing Eq. (2.8) with respect to the mixing proportions π subject to the
The mixing
constraint k=1 πk = 1. Hence, we introduce a Lagrange multiplier α to take into account the constraint and aim to solve
the following set of equations:
[ n K K
( K )]
∂ ∑ ∑ (m) ∑ ∑
zik log(πk ) − nγ log (ϵ + πk ) − α πk − 1 = 0.
∂πk
i=1 k=1 k=1 k=1
Given that ϵ is very close to zero, a straightforward calculation gives
{ [ n
]}
1 1∑ (m)
π̂k = max 0, zik −γ . (2.10)
1 − Kγ n
i=1
The (m + 1)st M step updates about βk by maximizing Eq. (2.9) for k = 1, 2, . . . , K , separately. Note that the density
function in Eq. (2.9) contains a nonparametric function λ0k (·), we adopt the profile likelihood approach (Johansen, 1983)
to update it. Denote λ0k (Yi ) as λi,k at time Yi . Maximizing over λi,k , we can obtain profile estimates of the hazards as a
function of βk .2
(m)
zik
λ̂i,k = {∑ )} . (2.11)
(m)
zjk exp β⊤
(
j:Yj ≥Yi k Xj
Replacing λ0k (Yi ) in Eq. (2.9) with λ̂i,k yields the following weighted partial likelihood:
⎡ ⎧ ⎫⎤
n ⎨∑ ⎬
∑
ℓc ,k βk ; D, Z (m)
δi zik(m) ⎣β⊤ (m)
zjk exp β⊤ ⎦.
( ) ( )
= k Xi − log k Xj (2.12)
⎩ ⎭
i=1 j:Yj ≥Yi
Moreover, the estimates in Eq. (2.11) are discrete and cannot be used directly to estimate the density function in Eq. (2.2).
Thus, following (Bordes and Chauveau, 2016), we apply a kernel smoothing technique to obtain a smooth estimator for
(m+1)
the baseline hazard function λ0k (·). Suppose that K (·) is a kernel function and h = hn is a bandwidth; then λ0k (t) is
estimated by
n ( )
(m+1) 1∑ t − Yi
λ̂
0k (t) = K λ̂i,k . (2.13)
h h
i=1
λ0k (·) is then updated by Eq. (2.13). Algorithm 1 summarizes the proposed modified EM algorithm.
Remark 2.1 (Initialization). To obtain the initial values for π (0) and Z (0) , we first give a large number of the initial
mixture components, say K = 10, and the proposed modified EM algorithm selects components in a backward way
by merging smaller components into a large component and reducing the number of components. Z (0) is initialized as
an n × K matrix, for each row of Z (0) , we randomly select one column from 1 to K and set the element of the selected
column to 1 and others to 0. The initial mixing probability π (0) can thus be calculated by taking the row average of Z (0)
[e.g., π (0) = (1/K , . . . , 1/K )⊤ ].
2 The detailed derivation is relegated to the online Appendix.
5
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
Algorithm 1: Modified EM Algorithm
(0)
m = 0, initialize π (0) and Z (0) , then estimate β(0) by maximizing Eq. (2.12) and λ0k (·) according to Eq. (2.13);
step E compute Z (m+1) according to Eq. (2.6);
step M update π (m+1) according to Eq. (2.10);
for k = 1, 2, . . . , K
(m+1)
update βk by solving Eq. (2.12);
(m+1)
update λ0k according to Eq. (2.13);
until convergence.
Remark 2.2 (Convergence of Algorithm 1). The M-step in Algorithm 1 contains two parts to update the parameters, one
(m+1)
part is updating π (m+1) according to Eq. (2.10), which has a closed form; the other part is updating βk by maximizing
Eq. (2.12). Note that Eq. (2.12) is a weighted log partial likelihood, which can be easily implemented by using the coxph()
function of the software R. Moreover, following Theorem 3 in Huang and Yao (2012), the modified EM algorithm possesses
an ascent property, which is a desired property.
lim inf n−1 ℓn (θ (m+1) ) − ℓn (θ (m) ) ≥ 0.
[ ]
n→∞
The numerical experiments conducted in Section 4 also demonstrate that the proposed algorithm generally converges
well.
2.3. Selection of tuning parameter
To implement the methods described in the previous sections, it is desirable to have an automatic method to select
the tuning parameter γ based on the survival data. In terms of selecting the tuning parameter γ , following the idea of Ma
and Huang (2017), a modified BIC with a grid search scheme can be used:
⎛ ⎞
n K̂
∑ ∑
BIC(γ ) = −2 π̂k fk Yi , δi |Xi ; β̂k ⎠ + Cn Df log(n),
( )
log ⎝
i=1 k=1
where Df = K̂ − 1 + K̂ p is the degree of freedom for the proposed Cox PH mixture model, K̂ is the estimate of the number
of components, and π̂k , β̂k are the estimates of πk , βk obtained by maximizing Eq. (2.7) for a given γ . Cn is a positive
number that may depend on n. When Cn = 1, the modified BIC reduces to the traditional BIC (Schwarz et al., 1978). Ma
and Huang (2017) used Cn = c log(log(n + p)) in their simulation study when the number p of predictors diverges with
sample size. In this paper, we adopt the same strategy
√ and let Cn = c log(log(n + K )), where c is a positive constant.
In our numerical study, we set the order γ = log(n)/n, which can simplify the search range of γ and is justified by
the theoretical result in Theorem 3.1. Numerical results in Section 4 also demonstrate the feasibility of this approach.
2.4. Model generalization
To make the proposed model (2.1) more flexible, as suggested by a reviewer, we consider a more general model that
allows the mixing probability πk ’s to vary with the covariates,
K
∑
f (y, δ|x) = πk (x; ξ )fk (y, δ|x). (2.14)
k=1
Note that (2.14) is a Mixture of Experts (MoE) model that was first proposed by Jacobs et al. (1991), where the gating
network is the conditional distribution of latent variable given the covariate variables, i.e.,
exp x⊤ ξ k
( )
πk (x; ξ ) = ∑K ), (2.15)
exp x⊤ ξ k
(
k=1
)⊤
with parameters ξ = ξ ⊤ 1 , . . . , ξK
⊤
(
. Although the varying mixing probability given by Eq. (2.15) is a straightforward
extension of constant mixing probability, the gating parameter ξ does not have an explicit solution based on the algorithm
6
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
proposed by Jordan and Jacobs (1994). To solve this problem, Xu et al. (1995) proposed a localized MoE model and the
estimation of parameters is based on f (x, y, δ ), i.e.,
K
∑
f (x, y, δ ) = p(Z = k)p(x|Z = k)fk (y, δ|x, Z = k), (2.16)
k=1
where Z is a latent discrete variable which follows a multinomial distribution with p (Z = k) = πk , and p(x|Z = k) is the
conditional distribution of x given Z = k. Without loss of generality and for easy illustration, we assume x|Z = k follows
multivariate Gaussian distribution with mean µk and covariance Σ k . The mixing probability of the kth component can
be expressed as
πk φ x; µk , Σ k
( )
p(Z = k)p(x|Z = k)
πk (x; µ, Σ ) = = ∑K ),
k=1 πk φ x; µk , Σ k
(
p(x)
where φ (·) is the multivariate gaussian density function. Now, the log-likelihood function of the localized MoE model for
sample (Xi , Yi , δi ) can be written as
n
[ K ]
∑ ∑
ℓ(θ ) = πk φ Xi ; µk , Σ k fk (Yi , δi |Xi ) ,
( )
log (2.17)
i=1 k=1
where θ = (π, µ, Σ , β). To determine the number of mixture components and estimate model parameters simultane-
ously, we consider the following penalized log-likelihood function,
K
∑
ℓP (θ ) = ℓ(θ ) − nγ [log (ϵ + πk ) − log(ϵ )] , (2.18)
k=1
where the penalty is imposed to determine the number of mixing component K similar as Eq. (2.5). Note that when the
covariates are high dimensional, we could add another penalty function [see, e.g., Jiang et al. (2018)] to Eq. (2.18), such
as the LASSO (Tibshirani, 1996) or SCAD penalty (Fan and Li, 2001). To save space, the algorithm to solve Eq. (2.18) is
relegated to the online Appendix.
2.5. Prediction
The proposed survival mixture model quantifies the probability that a new customer belongs to one of the identified
subgroups, therefore providing a more refined method in calibrating default risk than existing approaches, i.e. classifying
the customer into a high, low or intermediate risk group. Given a set of covariates Xi for a potential customer i, the
estimated probability of belonging to each group π̂k (Xi ) can be calculated.
In practice, if financial institutions want to predict when the default occurs for a potential customer, they can use the
following survival function to predict the no default yet probability at time t, i.e.,
S(t |Xi ) = π̂1 (Xi ) · S1 (t |Xi ; β̂1 ) + π̂2 (Xi ) · S2 (t |Xi ; β̂2 ) + · · · + π̂K̂ (Xi ) · SK̂ (t |Xi ; β̂K̂ ),
where K̂ , β̂ are the estimated number of mixing components and regression coefficients, Sk (t |Xi ; β̂k ) are the estimated
group-specific survival functions for the kth group. The smaller the predicted survival probability at time t (e.g. 12
months), the higher risk of default. If S(t |Xi ) < c, the financial institution can make a decision to reject a loan to the
debtor, where c is a threshold which can be determined by financial institutions.
3. Asymptotic properties
In this section, we derive the theoretical properties based on the general model in Section 2.4. We first introduce
some regularity conditions and then establish the consistency result of identifying the number of mixing components.
We also study the asymptotic properties of the parameters θ̂ estimated by modified EM algorithm. Denote the true value
of parameter vector by θ 0 and the components of θ 0 are denoted with a subscript, such as β0k and Θ is a compact
parameter space.
A1. The sample {(Xi , Yi , δi ), i = 1, . . . , n} are independent and identically distributed from its joint distribution f (x, y, δ ).
The support of x is bounded on the compact support of Rp .
[ f (x, y, δ ) has continuous first ]derivative and is positive on its support.
A2. The joint density
∑K
A3. Let ψ (θ ) = log πk φ x; µk , Σ k fk (y, δ|x) . For each θ ∈ Θ , ψ (θ ) has the third partial derivatives with respect
( )
k=1
to θ and there exists functions gm (x, y, δ ), m = 1, 2, 3, such that for θ in a neighborhood of θ 0 ,
⏐ ∂ψ (θ ) ⏐
⏐ ≤ g1 (x, y, δ ), ⏐ ∂ ψ (θ ) ⏐ ≤ g2 (x, y, δ ), ⏐ ∂ ψ (θ ) ⏐ ≤ g3 (x, y, δ ),
⏐ ⏐ ⏐ 2 ⏐ ⏐ 3 ⏐
⏐ ⏐ ⏐ ⏐
⏐
⏐ ∂θ i ⏐ ⏐ ∂θ i ∂θ j ⏐ ⏐ ∂θ i ∂θ j ∂θ k ⏐
where E {gk (x, y, δ )} < ∞ for all x, y, δ .
7
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
{ }
∂ 2 ψ (θ )
A4. θ 0 is the identifiably unique maximizer of E{ℓ(θ )}. The second derivative matrix E − is finite and positive
∂θ∂θ ⊤
at θ = θ 0 .
definite
A5. µk ≤ C1 , ∥Σ k ∥ ≤ C2 , k = 1, . . . , K , where C1 and C2 are large enough constants. mink,j {λj (Σ k ), j = 1, . . . , p; k =
1, . . . , K } ≥ C3 , where λj (Σ k ) are the eigenvalues of Σ k and C3 is a positive constant.
Remark 3.1. The conditions above are needed to derive the consistency result of identifying the number of mixing
components and asymptotic properties of θ̂ . Conditions (A1)-(A4) are mild conditions in the literature of mixture models
[see, e.g., Huang and Yao (2012), Xu et al. (2018)]. Condition (A5) is similar to conditions (P1) and (P2) in Huang et al.
(2017), which ensures the non-singularity of the covariance matrix Σ . This condition can be removed if we consider the
mixture model where the mixing proportions are independent of covariates.
First, we assume that the true number of mixing components is K0 , where K0 ≤ K , and the mixing probabilities satisfy
0, l = 1 , . . . , K − K0
{
πl =
π0k > 0, l = K − K0 + 1, . . . , K , k = 1, . . . , K0 .
In the spirit of locally conic parametrization (Dacunha-Castelle et al., 1999), we define
λl η, l = 1, . . . , K − K0
{
πl =
π0k + ρk η, l = K − K0 + 1 , . . . , K , k = 1 , . . . , K0 .
The joint density function of f (x, y, δ; η, ζ ) can then be rewritten as
K −K0
∑
f (x, y, δ; η, ζ ) = λl η · φ (x; µl , Σ l )fl (y, δ|x; βl )
l=1
K0
∑
+ (π0k + ρk η) · φ (x; µ0k + ηδk , Σ 0k + ηδk )fk (y, δ|x; β0k + ηδk ),
k=1
where
)⊤
ζ = λ1 , . . . , λK −K0 , µ1 , . . . , µK −K0 , Σ 1 , . . . , Σ K −K0 , β1 , . . . , βK −K0 , ρ1 , . . . , ρK0 , δ1 , . . . , δK0 .
(
Remark 3.2. The key idea of locally conic parametrization is to introduce a new parametrization for the model, the
parameter η captures the distance to the true model, and the second parameter ζ is the direction of perturbation. Since
η contains all the information of model order and is identifiable, it can be consistently estimated.
Similar to Dacunha-Castelle et al. (1999) and Huang et al. (2017), the following restrictions are imposed on ζ :
λl ≥ 0, µl , βl ∈ Rp , and Σ l ∈ Rp×p l = 1, . . . , K − K0 ,
δk ∈ Rp , ρk ∈ R, k = 1, . . . , K0 ,
K −K0 K0 K −K0 K0 K0
∑ ∑ ∑ ∑ ∑
λl + ρk = 0, λ2l + ρk2 + ∥δk ∥2 = 1.
l=1 k=1 l=1 k=1 k=1
By the permutation, such a parametrization is locally conic and identifiable. After the permutation, the penalized likelihood
function (2.18) can then be rewritten as
n K
∑ ∑
ℓP (η, ζ ) = log [f (Xi , Yi , δi ; η, ζ)] − nγ {log (ϵ + πk ) − log(ϵ )} . (3.1)
i=1 k=1
√
Theorem 3.1. Under conditions A1–A5 and restrictions on ζ , if limn→∞ nγ = c and ϵ = o( √n 1log n ), where c is a constant,
−1/2
there exists a local maximizer (η, ζ ) of Eq. (3.1) such that η = Op (n ), and, for such a local maximizer, the number of mixture
components K̂ → K0 with probability tending to one.
Theorem 3.1 indicates that, by choosing an appropriate tuning parameter γ and a small constant ϵ , the number of
mixing components can be selected correctly with probability tending to one.
Next, we prove the asymptotic normality property of the estimator θ̂ in Theorem 3.2. As Khalili and Chen (2007)
discussed, if the number of mixing components are estimated consistently in finite mixture of regression model, then
the asymptotic properties of the estimated parameters are the same as the oracle case, which the true number of mixing
components is known a priori.
8
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
Table 1
Bias and standard deviation (in parentheses) of parameter estimators based on 100 simulations for various sample sizes
and censoring rates.
Sample size Censoring rate Component Mixing probability β1 β2
1 −0.010(0.015) −0.023(0.167) −0.006(0.128)
5%
2 0.010(0.015) 0.033(0.143) 0.008(0.130)
n = 600
1 −0.010(0.014) −0.003(0.147) −0.001(0.111)
25%
2 0.010(0.014) 0.023(0.126) 0.008(0.123)
1 −0.010(0.014) −0.016(0.111) −0.015(0.090)
5%
2 0.010(0.014) 0.005(0.095) 0.015(0.092)
n = 900
1 −0.009(0.014) 0.016(0.122) 0.009(0.106)
25%
2 0.009(0.014) 0.007(0.102) 0.012(0.098)
Theorem 3.2. Under conditions A1–A5 and the conditions in Theorem 3.1, the estimator θ̂ has the asymptotic normality
√ ( ) D
n θ̂ − θ 0 −→ N 0, B−1 AB−1 ,
( )
{ ⏐ } { ⏐ }
∂ψ (θ ) ⏐ ∂ 2 ψ (θ ) ⏐
where A = Var ∂θ ⏐
and B = E − .
∂θ∂θ ⊤ ⏐
θ=θ 0 θ=θ 0
√
Theorem 3.2 characterizes the n convergence rate of the proposed estimator and the asymptotic normality property.
4. Numerical studies
In this section, we conduct a set of Monte Carlo simulation studies and real data analysis to assess the finite-sample
performance of the proposed method.
4.1. Simulation 1
In this simulation, we generate Ti from the following group-specific linear transformation model:
H(Ti ) = βk,1 Xi,1 + βk,2 Xi,2 + εi , i = 1, 2, . . . , n; k = 1, 2.
where H(t) = log(2(e − 1)) and εi follows the standard extreme-value distribution. In this case, the linear transformation
4t
model is equivalent to the Cox proportional hazards model. We generate samples from a two-component Cox proportional
hazards model with mixing weights π1 = 1/3, π2 = 2/3, and β1 = (−3, −2)T , β2 = (1, 1)T . The covariates Xi are
generated from a multivariate normal distribution with a mean of zero and a first-order autoregressive structure Σ = (σ )st
with σst = 0.5|s−t | for s, t = 1, 2. The censoring time is generated from a uniform distribution on [0, C ], where C is chosen
to achieve censoring proportions of 5% and 25%.
We run our proposed penalized likelihood method 100 times. The maximum initial number of components is set to
√ proportions are set to π = (1/10, 1/10, . . . , 1/10) . In our numerical study, we set the tuning
(0) ⊤
10, the initial mixing
parameter γ = c log(n)/n, where c is selected by using the proposed BIC. Fig. 2 shows the evolution of survival curves
across different number of groups for the modified EM algorithm, with the maximum number of components being 10.
We can see that, as the penalization proceeds, the final estimates of the survival curves follow closely the true survival
curves.
The left panel of Fig. 3 shows a histogram of the estimated component numbers given ten initial components. The
proposed method identifies the correct number of components with high probability. The right panel of Fig. 3 shows the
evolution of the penalized likelihood function for the simulated data from one run, showing how the proposed modified
EM algorithm converges numerically.
Moreover, we summarize in Table 1 the estimation of regression coefficients and mixing proportions when the number
of components is correctly identified. Table 1 shows that the modified EM algorithm accurately estimates the parameters
under different combinations of sample size and censoring proportions. To visualize the results in Table 1, Fig. 4 shows
boxplots of the estimated mixing proportions and Fig. 5 shows boxplots of estimated regression coefficients for the two
identified subgroups.
4.2. Simulation 2
In this simulation, we aim to examine the performance of our proposed Cox PH mixture model in Section 2.4 when
the mixing probability π varies with covariates. The data generating process is shown as follows:
• The latent variable Z follows a binomial distribution with the parameter π = (π1 , π2 ).
• Given the latent variable Z = k, the covariate variable x follows a bivariate gaussian distribution N µk , Σ k .
( )
9
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
Fig. 2. A typical run. (a) Survival curves for a simulated data set. (b) Random initialized survival curves for K = 10 components. (c) One intermediate
survival curve for K = 4. (d) Final estimated survival curves for K = 2.
Fig. 3. Histograms of estimated numbers of components (left panel) and the penalized log-likelihood function for one typical run (right panel).
10
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
Fig. 4. Boxplot of mixing probability based on 100 simulations with censoring rate 25% when n = 900.
Fig. 5. Boxplots of regression coefficients for two groups based on 100 simulations.
Table 2
Bias and standard deviation (in parentheses) of parameter estimators based on 100 simulations.
Component π β1 β2 µ1 µ2 Σ1 Σ2
−0.003 0.005 0.011 −0.005 −0.010 −0.011 −0.001
1
(0.002) (0.102) (0.066) (0.039) (0.058) (0.133) (0.012)
0.003 0.011 −0.002 0.009 0.000 −0.034 −0.001
2
(0.002) (0.053) (0.113) (0.068) (0.020) (0.140) (0.014)
• Given x, the mixing probability of the kth component can be written as
πk φ x; µk , Σ k
( )
P(Z = k|x) = ∑2 ).
k=1 πk φ x; µk , Σ k
(
• Given x and Z = k, the duration time T are generated from linear transformation model, which is the same setting
as Simulation 1.
In particular, we generate √1000 samples from a two-component Cox PH mixture model with π1 = π2 = 0.5,
µ1 = (−1, 1)⊤ , µ2 = (0, − 2)⊤ , Σ 1 = [0.65, 0.7794; 0.7794, 1.55], Σ 2 = diag{2, 0.2}, the group-specific regression
coefficients β1 = (−1, −1)⊤ , β2 = (1, 1)⊤ .
We run our proposed penalized likelihood algorithm 100 times. The maximum initial number of components is set to
be 10, the initial values πk0 = 1/10, k = 1, . . . , 10 and µk , Σ k are initialized by k-means clustering. When the number
of components is correctly identified, we summarize the estimation results of Cox PH mixture model in Table 2. For the
covariance matrix Σ k , we use eigenvalues to evaluate its accuracy. Table 2 shows that the proposed algorithm yields
accurate estimates for the model parameters.
11
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
Table 3
Characteristics of 1000 borrowers.
ID Variable description ID Variable description
1 Status of checking account 11 Present residence
2 Duration in month 12 Property
3 Credit history 13 Age in years
4 Purpose 14 Other installment plans
5 Credit amount 15 Housing
6 Savings account 16 Telephone
7 Present employment 17 Foreign worker
8 Installment rate 18 Job
9 Personal status and gender 19 Number of existing credits at this bank
10 Other debtors or guarantors 20 Number of people to provide maintenance
21 Default or no default
Table 4
Parameter estimation and standard error (in parentheses) for Cox PH mixture model, along with the
one-class Cox PH model.
Variable Cox PH model Subgroup 1 Subgroup 2
Credit amount −0.56(0.09) −1.94(0.22) −0.29(0.08)
Present residence – −0.15(0.10) –
Installment rate – – −0.27(0.10)
Own check account 0.94(0.18) 1.48(0.10) 0.44(0.24)
Purpose – – 0.46(0.23)
Saving account less than 100 DM 0.47(0.15) 0.33(0.20) 0.60(0.22)
Other debtors – −0.45(0.35) –
Other installment plan −0.37(0.15) – −0.53(0.21)
Own house −0.32(0.15) −0.44(0.20) −0.23(0.20)
Telephone – −0.35(0.21) –
Mixing probability – 0.542 0.458
4.3. The German credit data
We now apply the proposed method to the German credit data set which is available at the UCI repository.3 The
original data set consists of 1000 records and 21 features. Of these 1000 observations, 300 debtors defaulted, and the
remaining records (70% of the data set) are considered as correctly censored cases. The time to default ranges from 4 to
72 months. Table 3 lists the characteristics of the borrowers, taken from the original data source. In the subsequent data
analysis, the continuous variables are scaled to mean zero and unit variance. The categorical variables are coded using
dummy variables.
The primary goal for banks is to understand to whom they are lending. Once banks know the attributes and default
risks of their debtors, they can divide their lenders into several risk groups and decide whether to lend. To calibrate default
risk using our proposed model, we need to determine the number of components and the group-specific significant factors
for the Cox PH mixture model simultaneously.
Step 1: Fitting the proposed Cox PH mixture model
We first randomly split the data into a training set and a testing set with proportions 0.8 and 0.2, respectively. Then,
the penalized one-class Cox proportional hazards model was fitted to the survival data to select significant variables using
the glmnet R package provided by Friedman et al. (2009). Note that the regression parameters estimated by LASSO are
biased, motivated by the post-LASSO approach proposed by Belloni and Chernozhukov (2013), the regression coefficients
are estimated by fitting the Cox PH model again using the selected variables, and the results are presented in the second
column of Table 4.
Next, we apply our proposed penalized likelihood approach to determine the number of mixture components and
identify the group-specific significant factors simultaneously. The initial number of mixture components is set to be
ten and the modified EM algorithm identified two groups with mixing probabilities 0.542 and 0.458. The corresponding
estimated regression coefficients are presented in the last two columns of Table 4.
From Table 4, we can conclude that the credit amount, the status of checking account, the status of saving account
and house are significant variables whenever fitting a one-component or two-component mixture model. The regression
coefficients of credit amount are negative, which implies that the more credit amount offered by banks, the less likely
to default. It might be explained that banks would like to give more credit amount to those with good credit. There is a
higher risk of default if the saving account is less than 100 DM or the borrower is owning no house.
3 [Link]
12
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
Fig. 6. The status of checking account. (a) histogram of the status of checking account. (b) estimated survival curves for debtors owning a checking
account and for debtors with no checking account.
For the status of checking account, we can see that the hazard ratios (eβ ) of owning a checking account are both
larger than 1, which says that owning a checking account correlates with a higher probability to default. To interpret this
counter-intuitive finding, we first look at the original distribution of this variable. Fig. 6(a) shows the histogram of status
of checking account,4 where 0 (1) denotes a good (default) record, the default frequency differs between one sub-category
and another. Overall, A11 has the most bad credit records, it might be because that those whose checking account less
than 0 tend to be riskier. Moreover, we group the debtors whose checking account belongs to A11–A13 into the category
of owning checking account and plot in Fig. 6(b) the survival curves both for those owning a checking account and those
owning no checking account. Fig. 6(b) implies that, if a borrower owns a checking account, then he or she is more likely
to default.
From Table 4, we also observe that the significant factors are not exactly the same for different mixture components. For
example, the parameter estimates of other debtors in the first group is −0.45. It implies that for debtors have co-applicants
or guarantors, their hazard rate is 0.63 (e−0.45 ) times of those who have no co-applicants or guarantors, when holding
other effects unchanged. The parameter estimates of other install plans in the second group is −0.53, it implies that for
borrowers have no other installment plans, their hazard rate is 0.59 (e−0.53 ) times of those who have other installment
plans (see Fig. 6). .
Step 2: Model evaluation
We used time-dependent the area under the receiver operating characteristic curve (ROC) (Heagerty et al., 2000) for
censored survival data to evaluate model performance. The area under the ROC curve can yield values ranging from 0.5
to 1. The time-dependent ROC, denoted as ROC(t) is defined as follows:
Let Ti denote default time, we use the counting process Di (t) = 1 if Ti ≤ t and Di (t) = 0 if Ti > t to denote default status
at any time t with Di (t) = 1 indicating that subject i has defaulted prior to time t. The ROC curves display the relationship
between a covariate Xi and a binary default variable Di by plotting estimates of the sensitivity, P(X > c | D = 1), and one
minus the specificity, 1 − P(X ≤ c | D = 0) for all possible values c, i.e.,
sensitivity(c , t) = P {X > c | D(t) = 1},
specificity(c , t) = P {X ≤ c | D(t) = 0}.
Using these definitions, the corresponding ROC curve for any time t , ROC(t) can be defined. The sensitivity and the
specificity can be rewritten as follows by Bayes’ theorem
{1 − S(t | X > c)}P(X > c)
P {X > c | D(t) = 1} = ,
{1 − S(t)}
S(t | X ≤ c)P(X ≤ c)
P {X ≤ c | D(t) = 0} = ,
S(t)
where S(t) is the survival function S(t) = P(T > t) and S(t | X > c) is the conditional survival function for the subset
defined by X > c, which can be estimated by the Kaplan–Meier (KM) approach. The survivalROC package in R can be
implemented to calculate the time-dependent ROC.
4 Note that status of the checking account is a categorical variable. A11 indicates that the checking account has less than 0 DM, 0 DM ≤ A12 < 200
DM, the debtors associated with A13 have more than 200 DM and salary assignments for at least one year, and A14 represents a debtor with no
checking account.
13
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
Fig. 7. ROC curves of training data for assessing the predictive ability of one-class model over different times.
To assess the performance of the proposed Cox PH mixture model, we compared the one-class Cox PH model against
the two-class Cox PH model in terms of AUC (area under the ROC curve). Figs. 7 and 8 show the ROC curves of training
data and testing data for assessing the predictive ability of a one-class model and a two-class model across different
times, respectively. The two-class model yields an improved prediction of survival patterns across different survival times
compared with the one-class model. For example, an AUC of 0.87 was achieved by the two-class model and compares
favorably with 0.76 achieved by the one-class model at t = 36 months in the training data and an AUC of 0.938 was
achieved by the two-class model and compares favorably with 0.833 achieved by the one-class model at t = 28 months
in the testing data. The proposed modeling approach therefore outperforms the traditional one-class model for predicting
the survival time of debtors.
5. Discussion
This paper develops survival analysis techniques for credit risk modeling in finance. To well accommodate the
heterogeneous time to default data, we propose a Cox PH mixture model with an unknown number of mixing proportions.
The proposed method is a natural generalization of the classical Cox proportional hazards model by allowing individual-
specific covariate effects. Its advantage, compared with the conventional method, is that it simultaneously estimates the
model parameters and the number of mixing components from the data. We also provide the identification condition for
the proposed model, which depends on the regression coefficients rather than the baseline hazard functions. Finally, we
establish the results of model selection consistency and asymptotic normality of the estimated model parameters.
Moreover, we extend the Cox PH mixture model to a more general case, where the mixing proportions vary with
the covariates. We use the idea of localized Mixture of Experts (MoE) to determine the number of mixing components
and estimate model parameters simultaneously. In practice, we are often faced with the variable selection problem in
the presence of high-dimensional covariates. One way to solve this problem is to add another penalty to our proposed
penalized likelihood function. We have implemented this approach in the numerical studies and found that significant
variables can be identified and allowed to differ across the subgroups. Theoretical results in this direction will be pursued
in our future work.
14
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
Fig. 8. ROC curves of testing data for assessing the predictive ability of one-class model over different times.
Acknowledgments
The authors would like to thank the editor, associate editor and two anonymous referees for many helpful comments
and suggestions, which have substantially improved the paper. Dr. Pei’s research was supported by the National Natural
Science Foundation of China (NSFC) (No. 11901351) and Shandong Social Science Planning Fund (No. 21DTJJ04). Dr. Peng’s
research was supported by The Hong Kong Research Grant Council [12303618, 12302022], the Initiation Grant for Faculty
Niche Research Areas RC-FNRA-IG/20-21/SCI/05 from Hong Kong Baptist University and the National Natural Science
Foundation of China (NSFC) (No. 11871409, No. 119710018). Dr. Xu’s research was supported by The Hong Kong Research
Grant Council (17308820) and the National Natural Science Foundation of China (NSFC) (No. 72033002).
Appendix A. Supplementary data
Supplementary material related to this article can be found online at [Link]
References
Bai, M., Zheng, Y., Shen, Y., 2022. Gradient boosting survival tree with applications in credit scoring. J. Oper. Res. Soc. 73 (1), 39–55.
Belloni, A., Chernozhukov, V., 2013. Least squares after model selection in high-dimensional sparse models. Bernoulli 19 (2), 521–547.
Bellotti, T., Crook, J., 2009. Credit scoring with macroeconomic variables using survival analysis. J. Oper. Res. Soc. 60 (12), 1699–1707.
Bordes, L., Chauveau, D., 2016. Stochastic EM-like algorithms for fitting finite mixture of lifetime regression models under right censoring. In: Joint
Statistical Meeting 2016. pp. 1735–1746.
Cao, R., Vilar, J.M., Devia, A., 2009. Modelling consumer credit risk via survival analysis. SORT 33 (1), 3–30.
Cox, D.R., 1972. Regression models and life-tables. J. R. Stat. Soc. Ser. B Stat. Methodol. 34 (2), 187–202.
Dacunha-Castelle, D., Gassiat, E., et al., 1999. Testing the order of a model using locally conic parametrization: population mixtures and stationary
ARMA processes. Ann. Statist. 27 (4), 1178–1209.
Dirick, L., Bellotti, T., Claeskens, G., Baesens, B., 2019. Macro-economic factors in credit risk calculations: including time-varying covariates in mixture
cure models. J. Bus. Econom. Statist. 37 (1), 40–53.
Dirick, L., Claeskens, G., Baesens, B., 2017. Time to default in credit scoring using survival analysis: a benchmark study. J. Oper. Res. Soc. 68 (6),
652–665.
Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 (456), 1348–1360.
Friedman, J., Hastie, T., Tibshirani, R., 2009. Glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. Vol. 1. (4), R Package Version.
Heagerty, P.J., Lumley, T., Pepe, M.S., 2000. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56 (2),
337–344.
15
Y. Pei, H. Peng and J. Xu Journal of Econometrics xxx (xxxx) xxx
Huang, T., Peng, H., Zhang, K., 2017. Model selection for Gaussian mixture models. Statist. Sinica 27 (1), 147–169.
Huang, M., Yao, W., 2012. Mixture of regression models with varying mixing proportions: a semiparametric approach. J. Amer. Statist. Assoc. 107
(498), 711–724.
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E., 1991. Adaptive mixtures of local experts. Neural Comput. 3 (1), 79–87.
Jiang, Y., Yu, C., Ji, Q., 2018. Model selection for the localized mixture of experts models. J. Appl. Stat. 45 (11), 1994–2006.
Johansen, S., 1983. An extension of Cox’s regression model. Internat. Statist. Rev. 51 (2), 165–174.
Jordan, M.I., Jacobs, R.A., 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Comput. 6 (2), 181–214.
Khalili, A., Chen, J., 2007. Variable selection in finite mixture of regression models. J. Amer. Statist. Assoc. 102 (479), 1025–1038.
Ma, S., Huang, J., 2017. A concave pairwise fusion approach to subgroup analysis. J. Amer. Statist. Assoc. 112 (517), 410–423.
McLachlan, G., McGiffin, D., 1994. On the role of finite mixture models in survival analysis. Stat. Methods Med. Res. 3 (3), 211–226.
Nagpal, C., et al., 2021. Deep Cox mixtures for survival regression. In: Machine Learning for Healthcare Conference. PMLR, pp. 674–708.
Schwarz, G., et al., 1978. Estimating the dimension of a model. Ann. Statist. 6 (2), 461–464.
Shen, J., He, X., 2015. Inference for subgroup analysis with a structured logistic-normal mixture model. J. Amer. Statist. Assoc. 110 (509), 303–312.
Shi, S., Tse, R., Luo, W., DAddona, S., Pau, G., 2022. Machine learning-driven credit risk: a systemic review. Neural Comput. Appl. 34, 14327–14339.
Thomas, L.C., Edelman, D.B., Crook, J.N., 2002. Credit Scoring and Its Applications. SIAM.
Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 (1), 267–288.
Wu, R., Zheng, M., Yu, W., 2016. Subgroup analysis with time-to-event data under a logistic-cox mixture model. Scand. J. Stat. 43 (3), 863–878.
Xu, L., Jordan, M.I., Hinton, G.E., 1995. An alternative model for mixtures of experts. Adv. Neural Inf. Process. Syst. 633–640.
Xu, P., Peng, H., Huang, T., 2018. Unsupervised learning of mixture regression models for longitudinal data. Comput. Statist. Data Anal. 125, 44–56.
Yan, X., Yin, G., Zhao, X., 2021. Subgroup analysis in censored linear regression. Statist. Sinica 31 (2), 1027–1054.
You, N., He, S., Wang, X., Zhu, J., Zhang, H., 2018. Subtype classification and heterogeneous prognosis model construction in precision medicine.
Biometrics 74 (3), 814–822.
16