0% found this document useful (0 votes)
12 views23 pages

Article

The document provides instructions for authors on how to submit corrections for their article proofs, emphasizing the importance of clarity and accuracy in corrections. It also outlines the publication timeline, stating that the article will be published online approximately one week after receipt of corrections and that further changes will not be possible post-publication. Additionally, it includes metadata information about the article, including the title, authors, and abstract related to improving Linear Discriminant Analysis for predictive analytics using Electronic Health Records.

Uploaded by

sunzeyi6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views23 pages

Article

The document provides instructions for authors on how to submit corrections for their article proofs, emphasizing the importance of clarity and accuracy in corrections. It also outlines the publication timeline, stating that the article will be published online approximately one week after receipt of corrections and that further changes will not be possible post-publication. Additionally, it includes metadata information about the article, including the title, authors, and abstract related to improving Linear Discriminant Analysis for predictive analytics using Electronic Health Records.

Uploaded by

sunzeyi6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Dear Author

Here are the proofs of your article.

• You can submit your corrections online or by fax.


• For online submission please insert your corrections in the online correction form.
Always indicate the line number to which the correction refers.
• Please return your proof together with the permission to publish confirmation.
• For fax submission, please ensure that your corrections are clearly legible. Use a
fine black pen and write the correction in the margin, not too close to the edge of the
page.
• Remember to note the journal title, article number, and your name when sending
your response via e-mail, fax or regular mail.
• Check the metadata sheet to make sure that the header information, especially
author names and the corresponding affiliations are correctly shown.
• Check the questions that may have arisen during copy editing and insert your
answers/corrections.
• Check that the text is complete and that all figures, tables and their legends are
included. Also check the accuracy of special characters, equations, and electronic
supplementary material if applicable. If necessary refer to the Edited manuscript.
• The publication of inaccurate data such as dosages and units can have serious
consequences. Please take particular care that all such details are correct.
• Please do not make changes that involve only matters of style. We have generally
introduced forms that follow the journal’s style.
Substantial changes in content, e.g., new results, corrected values, title and
authorship are not allowed without the approval of the responsible editor. In such a
case, please contact the Editorial Office and return his/her consent together with the
proof.
• If we do not receive your corrections within 48 hours, we will send you a reminder.

Please note
Your article will be published Online First approximately one week after receipt of
your corrected proofs. This is the official first publication citable with the DOI.
Further changes are, therefore, not possible.
After online publication, subscribers (personal/institutional) to this journal will have
access to the complete article via the DOI using the URL:
[Link]
If you would like to know when your article has been published online, take advantage
of our free alert service. For registration and further information, go to:
[Link]
Due to the electronic nature of the procedure, the manuscript and the original figures
will only be returned to you on special request. When you return your corrections,
please inform us, if you would like to have these documents returned.
The printed version will follow in a forthcoming issue.
AUTHOR'S PROOF

Metadata of the article that will be visualized in OnlineFirst

1 Article Title Improving covariance-regularized discriminant analysis for EHR-based


predictive analytics of diseases
2 Article Sub- Title
3 Article Copyright - Springer Science+Business Media, LLC, part of Springer Nature 2020
Year (This will be the copyright line in the final PDF)
4 Journal Name Applied Intelligence
5 Family Name Sun
6 Particle
7 Given Name Zeyi
8 Corresponding Suffix
9 Author Organization Mininglamp Technology
10 Division Mininglamp Academy of Sciences
11 Address Beijing 100084, China
12 e-mail sunzeyi@[Link]
13 Family Name Yang
14 Particle
15 Given Name Sijia
16 Suffix
17 Author Organization Beijing University of Posts and Telecommunications

18 Division School of Cyberspace Security, State Key Laboratory of


Networking and Switching
19 Address Haidian, Beijing, China
20 e-mail ysjhhh@[Link]
21 Family Name Xiong
22 Particle
23 Given Name Haoyi
24 Suffix
Author
25 Organization Missouri University of Science and Technology
26 Division Department of Computer Science
27 Address Rolla 65409, MO, USA
28 e-mail [Link]@[Link]
29 Family Name Xu
30 Particle
AUTHOR'S PROOF

31 Given Name Kaibo


32 Author Suffix
33 Organization Mininglamp Technology
34 Division Mininglamp Academy of Sciences
35 Address Beijing 100084, China
36 e-mail xukaibo@[Link]
37 Family Name Wang
38 Particle
39 Given Name Licheng
40 Suffix
41 Author Organization Beijing University of Posts and Telecommunications
42 Division School of Cyberspace Security, State Key Laboratory of
Networking and Switching
43 Address Haidian, Beijing, China
44 e-mail wanglc@[Link]
45 Family Name Bian
46 Particle
47 Given Name Jiang
48 Suffix
Author
49 Organization University of Central Florida
50 Division Department of Electrical and Computer Engineering
51 Address Orland 32816, FL, USA
52 e-mail bj1119@[Link]
53 Received
54 Revised
55 Accepted

Schedule
AUTHOR'S PROOF

56 Abstract Linear Discriminant Analysis (LDA) is a well-known technique for feature


extraction and dimension reduction. The performance of classical LDA,
however, significantly degrades on the High Dimension Low Sample Size
(HDLSS) data for the ill-posed inverse problem. Existing approaches for
HDLSS data classification typically assume the data in question are with
Gaussian distribution and deal the HDLSS classification problem with
regularization. However, these assumptions are too strict to hold in many
emerging real-life applications, such as enabling personalized predictive
analysis using Electronic Health Records (EHRs) data collected from an
extremely limited number of patients who have been diagnosed with or without
the target disease for prediction. In this paper, we revised the problem of
predictive analysis of disease using personal EHR data and LDA classifier. To
fill the gap, in this paper, we first studied an analytical model that understands
the accuracy of LDA for classifying data with arbitrary distribution. The model
gives a theoretical upper bound of LDA error rate that is controlled by two
factors: (1) the statistical convergence rate of (inverse) covariance matrix
estimators and (2) the divergence of the training/testing datasets to fitted
distributions. To this end, we could lower the error rate by balancing the two
factors for better classification performance. Hereby, we further proposed a
novel LDA classifier De-Sparse that leverages De-sparsified Graphical Lasso
to improve the estimation of LDA, which outperforms state-of-the-art LDA
approaches developed for HDLSS data. Such advances and effectiveness are
further demonstrated by both theoretical analysis and extensive experiments on
EHR datasets [Link]
57 Keywords Linear discriminant analysis - De-sparsified graphical lasso - Electronic health
separated by ' - ' records - High dimension low sample size
58 Foot note Springer Nature remains neutral with regard to jurisdictional claims in published
information maps and institutional affiliations.
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Applied Intelligence
[Link]

Improving covariance-regularized discriminant analysis


for EHR-based predictive analytics of diseases
Sijia Yang1 · Haoyi Xiong2 · Kaibo Xu3 · Licheng Wang1 · Jiang Bian4 · Zeyi Sun3

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract
Linear Discriminant Analysis (LDA) is a well-known technique for feature extraction and dimension reduction. The
performance of classical LDA, however, significantly degrades on the High Dimension Low Sample Size (HDLSS) data for
the ill-posed inverse problem. Existing approaches for HDLSS data classification typically assume the data in question are

F
with Gaussian distribution and deal the HDLSS classification problem with regularization. However, these assumptions are

O
too strict to hold in many emerging real-life applications, such as enabling personalized predictive analysis using Electronic
Health Records (EHRs) data collected from an extremely limited number of patients who have been diagnosed with or

O
without the target disease for prediction. In this paper, we revised the problem of predictive analysis of disease using

PR
personal EHR data and LDA classifier. To fill the gap, in this paper, we first studied an analytical model that understands
the accuracy of LDA for classifying data with arbitrary distribution. The model gives a theoretical upper bound of LDA
error rate that is controlled by two factors: (1) the statistical convergence rate of (inverse) covariance matrix estimators
D
and (2) the divergence of the training/testing datasets to fitted distributions. To this end, we could lower the error rate by
TE

balancing the two factors for better classification performance. Hereby, we further proposed a novel LDA classifier De-
Sparse that leverages De-sparsified Graphical Lasso to improve the estimation of LDA, which outperforms state-of-the-art
EC

LDA approaches developed for HDLSS data. Such advances and effectiveness are further demonstrated by both theoretical
analysis and extensive experiments on EHR datasets [Link]
R

Keywords Linear discriminant analysis · De-sparsified graphical lasso · Electronic health records ·
R

Q1 High dimension low sample size


O
C

0 1 Introduction or ill-posed since the data is often with high dimension but 12
N

low sample size (HDLSS) [4]. 13


1 Linear Discriminant Analysis (LDA) [1] is a well-known Recently, many efforts have been devoted to bear on 14
U

2 technique for feature extraction and dimension reduction. It such HDLSS problems. For example, Krzanowski et al. pro- 15
3 has been widely used in many applications [2, 3] such as posed a pseudo-inverse LDA to approximate the inverse 16
4 face recognition, image retrieval, etc. Typically, LDA finds covariance matrix, when the sample covariance matrix is 17
5 the projection directions such that for the projected data, singular. However, the accuracy of pseudo-inverse LDA is 18
6 the between-class variance has been maximized relative usually low and not well guaranteed [5]. Another tech- 19
7 to the within-class variance, thus achieving maximum nique to alleviate this problem is a two-stage algorithm, 20
8 discrimination. An intrinsic limitation of classical LDA is i.e., PCA+LDA [6, 7]. More popularly, regularized LDA 21
9 that its objective function requires the nonsingularity of one approaches are proposed to solve the problem and improve 22
10 of the scatter matrices. For many applications, such as the the performance [8]. For example, researchers proposed 23
11 microarray data analysis, all scatter matrices can be singular a series of algorithms to regularize the covariance matrix 24
estimation [2, 5, 9]. The regularized linear discriminant 25
hyperplane was studied in [4, 10, 11]. All regularized 26
 Zeyi Sun LDA approaches intend to improve LDA through regular- 27
sunzeyi@[Link]
izing the estimation of key parameters used in LDA, such 28
as the covariance matrix and/or the linear coefficients for 29
Extended author information available on the last page of the article. discrimination. 30
Q2
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Sijia Yang et al.

31 One representative regularized LDA approach is Covari- scientists can share and analyze the EHR data to enable 84
32 ance Regularized Discriminant Analysis (CRDA) proposed innovative health services, such as providing computer- 85
33 in [9] based on the sparse inverse covariance estimation assisted diagnosis and offering medication advice. Among 86
34 leveraging Graphical Lasso [12]. CRDA was originally these services, predictive analytics of diseases (or namely 87
35 proposed to estimate the inverse covariance matrix via a early detection of diseases) using patients’ past longitudinal 88
36 shrunken estimator, so as to achieve “superior prediction”. health information of the EHR system, has recently attracted 89
37 Intuitively, through replacing the sample covariance matrix significant attention from research communities. 90
38 used in LDA with the regularized estimation, the HDLSS There has been a series of works [19–24], which 91
39 problem can be well addressed since the regularized esti- attempt to predict future disease of patients, through 92
40 mators usually outperform the sample covariance matrix data mining techniques using EHR data. Prior literature 93
41 estimator [13]. To better elucidate the performance of LDA usually first selected important features, such as diagnosis- 94
42 classifiers with uncertain covariance matrix estimates for frequencies [19], pairwise diagnosis transitions [22], and 95
43 Gaussian data classification, [14] studied a model of error graphs of diagnosis sequences [24], to represent the EHR 96
44 rate by matching the estimated vs. true covariance matri- data of the patients. Then, a wide range of supervised 97
45 ces, and the estimated vs. true means. While it is reasonable learning algorithms were adopted to build predictive models 98
46 to assume that the estimated mean can easily converge to for early disease detection, on top of well-represented EHR 99
47 the population/true mean, the population/true covariance data. 100

F
48 matrix is usually unknown and can be very different with In this paper, we first propose a novel analytical model 101

O
49 the estimated one [13]. For example, the largest eigenvalue for LDA error rate, based on the statistical convergence of 102
50 of the sample covariance matrix, which represents the prin- (inverse) covariance matrix estimators and the divergence 103

O
51 ciple component of the data distribution, is not consistent to the Gaussian distributions. Guided by the proposed 104

PR
52 with the population one and the eigenvectors of the sam- analytical model, we propose a novel LDA classifier 105
53 ple covariance matrix can be almost orthogonal to the truth leveraging the (inverse) covariance matrix estimators with 106
54 under HDLSS [15, 16]. Further, the data for classifica- faster convergence rate. We apply our model to a large-scale 107
D
55 tion is usually Non-Gaussian. Thus, it is highly desirable EHR dataset for the predictive analytics of diseases and 108
TE

56 to develop a new analytical model to characterize the error demonstrate the advantage of the proposed algorithms over 109
57 rate for the data with arbitrary distribution (both Gaussian other state-of-the-arts. Specifically, in this paper, we made 110
58 and Non-Gaussian). Two “known factors” of covariance contributions as follows. 111
EC

59 matrix estimation are useful for developing such analytical


60 models, one is the convergence rate and the other one is 1. We studied the problem of high-dimensional linear 112
R

61 the sparsity/density of (inverse) covariance matrix estima- classification using LDA models and proposed a novel 113
tors [13]. The sparsity/density is already known once the
R

62 analytical model, derived from the existing LDA 114


63 (inverse) covariance matrix is estimated. The convergence models on Gaussian data [14, 25], to understand the 115
O

64 rates reflect the maximal error of estimation, and for some LDA error rate for both Gaussian and Non-Gaussian 116
C

65 estimators, they are well bounded under certain assump- data, with respect to the statistical error of covariance 117
N

66 tions, such as spectral-norm convergence rate of Graphical matrices estimation and the divergence between fitted 118
67 Lasso [17]. Gaussian distribution and the data. 119
U

68 Among a wide range of HDLSS data classification 2. On top of the analytical model, we proposed De-Sparse, 120
69 tasks, in this work, we focus on the problem of using which extends the well-known baseline approach – 121
70 LDA to classify EHR [19] for personalized predictive Covariance Regularized Discrimiant Analysis (CRDA) 122
71 analytics of target disease. EHRs play a critical role [9, 26], using De-sparsified Graphical Lasso [27]. 123
72 in modern health information management and service Theoretical analysis based on the proposed analytical 124
73 innovations. A patient’s EHR contains his/her histories model shows that De-Sparse can bound the maximal 125
74 of medical visit, medication, diagnoses, treatment plans, error rate, under mild conditions. Compared to CRDA, 126
75 allergies and so on as shown in Fig. 1. Per each visit a De-Sparse could achieve lower error rate, due to 127
76 diagnosis record would be updated indicating the disease the faster convergence rate of De-sparsified Graphical 128
77 state, i.e., a set of codes referring to the diseases that Lasso. 129
78 diagnosed at a time of visit. One significant feature is 3. To show the practical contribution of the proposed 130
79 the interchangeability of EHR, as a standard protocol for method, we evaluated De-Sparse extensively through 131
80 medical/health data generation, storage and communication. experiments with large-scale, real-world EHR datasets 132
81 The health information is built and managed by authorized [28]. In the experiments, we used De-Sparse to predict 133
82 institutions in a unified digital format (e.g., ICD-9/10, CPT- the risk of mental health disorders in college students 134
83 9/10 used in EHR standards) such that researchers and from ten U.S. universities, using their EHR data from 135
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Improving covariance-regularized discriminant analysis...

Record 1 Record 2 Record 3 Record m

(Disease States) D1 D2 D2
Diagnosis D1 D2 D3 D5
D2 D3 D4 D7
D5 D8

Fig. 1 Medicare visits and Electronic Health Records (EHRs). EHRs the diseases/treatments that have been diagnosed/carried out. One can
of a patient consist of the records of diagnoses and treatments. In this enable the early diagnosis/detection of diseases through classifying the
example, totally m medical visits have been placed. For every medical EHR data, with big data and machine learning techniques
visit, the patient would receive a set of ICD/CPT codes [18] referring

F
136 primary care visits. We compared De-Sparse with seven μ̄− are estimated as the mean vectors of the positive training 162

O
137 baseline algorithms including other regularized LDA samples and negative training samples in the N training 163

O
138 models and downstream classifiers. The evaluation samples, respectively. 164

PR
139 result shows that De-Sparse outperforms all baselines,
140 and further validates our theoretical analysis. Corollary 1 (Fisher’s Discriminant Analysis for Binary 165
Classification [1]) Given all estimated parameters Σ̄, μ̄+ , 166
141 The paper is organized as follows. In Section 2, we review
D
and μ̄− , the FDA model classifies a new data vector x as 167
142 the problem of high-dimensional linear classification using
the result of (2) as follows. 168
TE

143 LDA models and summarize the existing work on EHR-


144 based predictive analytics of diseases. In Section 3, we first  
introduce the existing covariance-regularized discriminant x  Σ̄ −1 μ̄+ − 12 μ̄ −1
+ Σ̄ μ̄+ + log π+
EC

145
fΣ̄ (x) = sign log ,
146 analysis (CRDA) based on Graphical Lasso, then present x  Σ̄ −1 μ̄− − 12 μ̄ −1
− Σ̄ μ̄− + log π−
147 de-sparsified covariance regularized LDA algorithms, based
(2)
R

148 on novel de-sparsified inverse covariance matrix estimators,


R

149 to classify EHR samples for the predictive analytics. In


Section 4, we validate the proposed algorithms with real- where sign(·) : R → {±1} refers to the signal function, 169
O

150
151 world datasets through extensive experiments. Finally, we π+ and π− refer to the (foreknown) frequencies of positive 170
C

152 conclude the paper in Section 5. samples and negative samples in the whole population. 171
N

To present LDA with other covariance matrix estimator, 172


U

153 2 Preliminaries based on the LDA paradigm listed in (2), we use the 173
notations as follows. 174

154 2.1 LDA for binary classification


Notations Note that, in the rest of this paper, we denote 175

155 To use Fisher’s Linear Discriminant Analysis (FDA), given fΣ (x) as an LDA classifier with a specific covariance 176

156 N labeled data pairs (x1 , l1 ), (x2 , l2 ), (x3 , l3 ) . . . (xN , lN ) matrix estimator Σ, using the sample estimated mean vec- 177

157 and ∀xi , 1 ≤ i ≤ N refers to a d-dimensional vector, we tors μ̄− and μ̄+ . When Σ  = Σ̄, then the classifier f (x) 178
Σ̄
158 first estimate the sample covariance matrix (an symmetric becomes the traditional Fisher’s Linear Discriminant Anal- 179

159 d × d matrix) using maximized likelihood estimator: ysis. When Σ=Θ −1 and Θ  is the Graphical Lasso estima- 180
tor [12], then fΘ−1 (x) refers to the covariance regularized 181
1 
N
LDA [9, 26]. 182
Σ̄ = (xi − μ̄)(xi − μ̄) , (1) Apparently, the performance of LDA depends on (1) 183
N
i=1
whether the realistic training/testing datasets follow Gaus- 184
160 where μ̄ refers to the d-dimensional mean vector of all N sian distributions and (2) how the mean vectors and inverse 185
161 training samples (x1 , l1 ), (x2 , l2 ) . . . (xN , lN ). Then, μ̄+ and covariance matrices are estimated from the datasets. 186
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Sijia Yang et al.

187 2.2 Electronic health records and predictive analytic uncertainty measures. Over the rough set, the same group 237
188 of disease of authors [47] adopted a joint feature selection approach 238
that incorporates neighborhood entropy and the fisher 239
189 Prior to learning a predictive analytic model for certain scores, for tumor classification. Further, some automatic 240
190 diseases, one needs to model the EHR data with a feature weighting paradigm has been proposed to select 241
191 suitable data representation. The most simple yet effective features for gene expression data classification [48]. These 242
192 way to represent EHR data is to use diagnosis-frequency studies demonstrate that the feature selection algorithms 243
193 vector [29–31], which is similar to Term Frequency (TF) could significantly improve the accuracy of HDLSS data 244
194 or Term Frequency-Inverse Document Frequency (TF- classification, while avoiding the full set of features. The 245
195 IDF) approach that deals with traditional NLP data [32– over-reduction problem of LDA has been studied in [49]. 246
196 34]. Given each patient’s EHR data (shown in Fig. 1), In addition to the EHR data, similar regularized projection 247
197 this representation method first retrieves the diagnosis methods have been used for early diagnosis of diseases for 248
198 codes [35] recorded during each visit. Inspired by some biomedical health data [50, 51]. 249
199 Natural Language Processing (NLP) and text mining In terms of methodologies, the most close work to this 250
200 practices [36], researchers also proposed using some deep study is covariance-regularized linear discriminant analysis 251
201 learning based NLP approaches to embed EHR records (CRDA) [9], Graphical Lasso [17], and the de-sparsified 252
202 for data representation learning [37–42]. For example, [38] Graphical Lasso [27]. CRDA regularizes the estimation 253

F
203 studied “Patient2Vec” which embeds patients’ past EHR of (inverse) covariance matrices inside the estimation 254

O
204 records into vectors while preserving structural information of LDA, while improving the performance of LDA for 255
205 for personalized predictive analysis. Bai et al. [40] both prediction and inference. Authors in [26] were the 256

O
206 focuses on the interpolation and interpretability of EHR first to bring CRDA for EHR classification and early 257

PR
207 representation learning, where authors well-balanced the detection of diseases. We included the algorithms in [26] 258
208 performance of predictive analysis and the understanding for comparison and found that De-Sparse outperformed 259
209 to the longitude disease progress of each individual patient, CRDA with higher accuracy and F1-score. Compared to 260
D
210 both using the EHR data with the learned representation. the Graphical Lasso estimator [17] that has been frequently 261
TE

211 Comprehensive surveys could be found in [18, 43, 44]. used to enhance the inverse covariance matrices estimation, 262
212 In our work, we follow the line of research that uses our work followed the ideas of de-biased estimator [52] 263
213 diagnosis-frequency vector of each patient for EHR-based and used de-sparsified Graphical Lasso estimator [27] 264
EC

214 predictive analysis [29–31], as the diagnosis-frequency in to improve the LDA for EHR classification. We would 265
215 a certain duration could well characterize the health status provide a comprehensive discussion on the performance 266
R

216 of patients while the coefficient of LDA can represent the comparisons between Graphical Lasso and de-sparsified 267
significance of every diagnosis code. The frequency of each one from the perspectives of predictive analytics based on
R

217 268
218 diagnosis appearing in all past visits (of the last two years) EHR data and LDA. 269
O

219 is counted here, followed by further transformation on the


C

220 frequency of each diagnosis into a vector of frequencies. For


3 De-Sparse: De-sparsified
N

221 example, a diagnosis-frequency vector can be illustrated as 270


222 1, 0, . . . , 3, where 0 means the second diagnosis does not covariance-regularized discriminant analysis 271
U

223 exist in all past visits. Note that the dimension d ≥ 15, 000
224 when using original ICD-9 codes, d = 295 even when using In this section, we first introduce the baseline algorithm 272
225 clustered ICD-9 codes [45], while the number of samples for based on Covariance-Regularized Discriminant Analysis 273
226 training N in our experiment is significantly smaller than d. (CRDA) that is derived from [26]. Then, we present the pro- 274
posed algorithm De-Sparse, an extended Covariance Reg- 275
227 2.3 Discussion on preliminaries ularized Discriminant Analysis via De-sparsified Graphical 276
Lasso [27]. Then, using our proposed analytical model of 277
228 In our work, we revisited the linear discriminant analysis LDA error rate, we compare two methods and demonstrate 278
229 as a classifier and learner for High-Dimensional and Low the advantages of De-Sparse. 279
230 Sample Size (HDLSS) settings. Indeed, many efforts have
231 been made in the literature for HDLSS data classification. 3.1 CRDA: The baseline of covariance-regularized 280
232 For example, in addition to LDA-type methods, a number discriminant analysis via graphical lasso inverse 281
233 of feature extraction or variable selection methods have covariance matrix estimator 282
234 been studied. Lin et al. [46] proposed a feature selection
235 algorithm to classify the high-dimensional gene expression Compared to the sample LDA introduced in Section 2, 283
236 data through incorporating the neighborhood entropy-based CRDA [9, 26] was proposed to use 1 -penalized inverse 284
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Improving covariance-regularized discriminant analysis...

285 covariance matrix estimator to replace the inverse of To the end, the classification rule of CRDA is written as 327
286 sample covariance matrix. Given the labeled data pairs for follows 328
287 training (x1 , l1 ), (x2 , l2 ) . . . (xN , lN ), the algorithm first  
xΘ μ̄+ − 1 μ̄ 
2 + Θ μ̄+ + log π+
288 estimates the sample covariance matrix Σ̄ and the sample CRDA(x) = sign log ,
289 mean vectors μ̄+ , μ̄− using the algorithms introduced in xΘ μ̄− − 1 μ̄ 
− Θ μ̄− + log π−
2
290 Section 2.1. With the sample covariance matrix Σ̄, this (5)
method estimates a sparse inverse covariance matrix Θ 
291
which can be viewed as an LDA classifier using Θ −1 329
292 using the Graphical Lasso estimator [12] as follows.
as the covariance matrix. Apparently, the accuracy of 330
CRDA depends on how the covariance matrices and the 331
293 Corollary 2 (Graphical Lasso Estimator [12]) Given the
mean vectors are estimated. We are going to interpret the 332
294 sample estimation of the covariance matrix Σ̄, the Graph-
performance of CRDA in the Section 3.3. 333
295 ical Lasso estimator provides an 1 -regularized sparse
296 positive-definite approximation to the inverse covariance
 as follows. 3.2 De-Sparse : The improved algorithm 334
297 matrix (denoted as Θ)
of covariance-regularized LDA via de-sparsified 335
⎛ ⎞ graphical lasso 336

 = argmin⎝tr(Σ̄Θ) − log det(Θ) + λ
Θ |Θj k |⎠, (3)
Θ>0 As shown in (3), the estimator of sparse inverse covariance 337

F
j =k
matrix induced 1 -penalization and might hurt the estima- 338

O
298 where Θ > 0 refers to the constraint of symmetric positive tion due to the over-penalization or over-sparsification. To 339

O
299 definiteness (SPD), the term tr(Σ̄Θ) − log det(Θ) refers to address this issue, we proposed a de-sparsified Graphical 340
Lasso estimator [27] to replace the vanilla Graphical Lasso. 341

PR
300 the negative log-likelihood of the optimization objective Θ
301 over the sample estimate Σ̄, the term j =k |Θj k | refers
302 to the sum of absolute values of the non-diagonal elements Corollary 4 (De-sparsified Graohical Lasso [27]) Given 342
 and the sample estimation
the Graphical Lasso estimator Θ 343
D
303 in the matrix Θ (which is the same as the 1 -norm of
Σ̄, we consider the inverse of Graphical Lasso Θ −1 as 344
304 Θ without diagonal elements considered), and λ refers to
TE

305 tuning parameter that makes trade-off between the sparsity an approximation to the covariance matrix. In this way, 345
−1 , caused by the sparsity regularizer of
the bias of Θ 346
306 and the fitness to samples. Please refer to [12] for the
EC

307 implementation of the algorithms. Graphical Lasso, for covariance estimation could be written 347
as follows. 348

 = Σ̄ − Θ −1
R

308 Corollary 3 (Statistical Convergence of Graphical Z . (6)


309 Lasso [17]) Suppose the random vector X is with d-
R

Using the Kronecker product, authors in [27] consider the 349


310 dimensions and zero mean (i.e., X ∈ Rd and E (X) = 0),  against the inverse of population
O

potential bias term of Θ 350


311 where the population estimate of the covariance matrix is
covariance matrix as follows.
Σ = E (XX ) and the inverse of population covariance
351
C

312
313 matrix is Θ = Σ −1 . Given N samples x1 , x2 , x3 , . . . , xN  =Θ
Bias(Σ̄, Θ) ZΘ
=Θ
Σ̄ Θ
 − Θ.
 (7)
N

randomly and independently drawn from X, the sam-


The de-sparsified Graphical Lasso estimator T de-biases
314
U

352
315 ple estimate of the covariance matrix here should be  through removing the
 the Graphical Lasso estimator Θ 353
316 Σ̄ = N1 N i=1 xi xi . potential bias term caused by the sparsity reguarlizer, as 354
317 With the increasing number of samples (N) given and
follows. 355
318 growing number of dimensions of the data (d), the graphical
319 lasso estimate Θ based on the sample covariance matrix T = Θ
 − Bias(Σ̄, Θ)
 = 2Θ
−Θ
Σ̄ Θ.
 (8)
320 converges to the population estimate Θ at the rate of On top of the Graphical Lasso, the de-sparsified Graphical 356
321 Frobenius-norm under mild sparsity conditions, as follows Lasso estimator can efficiently approximate an estimation 357
322 [17]. of inverse covariance matrix using the data with faster 358
  convergence rate in a mild condition. 359

−Θ (d + s) log p
Θ F =O , (4)
N Corollary 5 (Statistical Convergence of De-sparsified 360
Graphical Lasso [27]) Suppose the random vector X is with 361
323 where s = max1≤i≤d Θi 0 refers to the maximal degree d-dimensions and zero mean (i.e., X ∈ Rd and E (X) = 0), 362
324 of the graph in Θ, · 0 refers to the 0 -norm of the input where the population estimate of the covariance matrix is 363
325 vector, and Θi refers to the i th column vector (1 ≤ i ≤ d) Σ = E (XX ) and the inverse of population covariance 364
326 of the matrix Θi . matrix is Θ = Σ −1 . Given N samples x1 , x2 , x3 , . . . , xN 365
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Sijia Yang et al.

366 randomly and independently drawn from X, the sample linear discriminant analysis (i.e., probability of l = fΣ (x)) 405
367 estimate of the covariance matrix here should be Σ̄ = on the data of N (μ+ , Σ), N (μ− , Σ) is modeled as follows. 406
1 m−1 
368
m i=0 xi xi . The Graphical Lasso estimator and the de-
369 sparsified estimator are denoted as Θ  and T, respectively. Corollary 6 (RMT-based LDA Error Rate Estimation [14]) 407
370 With the increasing number of samples (N) given and According to the random matrix theory, [14] models the 408
371 growing number of dimensions of the data (d), the De- expectation of classification error rate of LDA (using esti- 409
372 sparsified Graphical Lasso estimator T converges to the mated parameters μ̄+ , μ̄− , and Σ)  on Gaussian distri- 410
373 population estimate Θ at the rate of ∞ -norm under mild  μ+ ,
butions N (μ+ , Σ) and N (μ− , Σ) as ε(μ̄+ , μ̄− , Σ, 411
374 sparsity conditions, as follows [27]. μ− , Σ), as follows. 412
 
 log d  μ+ , μ− , Σ)
ε(μ̄+ , μ̄− , Σ,
Θ −Θ ∞ =O . (9)  
N
(μ+ − (μ̄+ +2 μ̄− ) ) Σ
−1 (μ̄+ − μ̄− )
= π+ · Φ −
375 Note that, above convergence rate of de-sparsified Graphi- (μ̄+ − μ̄− ) Σ −1 Σ Σ −1 (μ̄+ − μ̄− )
376 cal Lasso was obtained under similar sparsity assumptions  (μ̄+ +μ̄− )  −1

(μ− − ) Σ (μ̄+ − μ̄− )
377 as [17], while the 2 -norm or ∞ -norm convergence rates +π− · Φ 2

378 of Graphical Lasso are not known yet. (μ̄+ − μ̄− ) Σ −1 Σ Σ −1 (μ̄+ − μ̄− )
(11)

F
Based on Notations, we denote the De-sparsified

O
379
380 Covariance Regularized Discriminant Analysis (namely De- where Φ refers to the CDF function of a standard normal 413

O
381 Sparse) as Desparse(x), using the De-sparsified Graphical distribution. 414
Lasso T and the mean vectors μ̄+ , μ̄− .

PR
382

  According to Corollary 6 and [14], we could conclude 415


x  Tμ̄+ − 12 μ̄ 
+ T μ̄+ + log π+ that the expected error rate is sensitive with the parameters 416
Desparse(x) = sign log .  while the true parameters
D
x  Tμ̄− − 12 μ̄ 
− T μ̄− + log π−
μ+ , μ− , Σ, 
μ+ , 
μ− and Σ, 417
TE

(10) μ+ , μ− , Σ are assumed unknown. Compared to the 418


(inverse) covariance matrices estimation, the error of sample 419
383 With the de-sparsified inverse covariance matrix estimator mean vector estimation is relatively small [53]. Thus, we 420
EC

384 T enjoying better statistical properties, De-Sparse is adopt the settings in studies [2, 5, 25] as follows. 421
385 expected to outperform CRDA with better classification
R

386 accuracy. Detailed comparison will be discussed in the Assumption 1 In this paper, we make no assumptions on 422
the mean vectors μ+ , μ− , μ and always use the sample
R

387 following sections. 423


mean μ̄+ , μ̄− , μ̄ to estimate μ+ , μ− , μ. Even under the 424
O

388 3.3 Performance analysis of LDA, CRDA, HDLSS settings, with a certain number of samples, it is 425
C

389 and De-Sparse reasonable to assume the sample estimation of mean vectors 426
N

μ̄+ and μ̄− should be close to the population mean vectors, 427
390 In this section, we first review the previous studies on i.e., |μ+ − μ̄+ | → 0, |μ− − μ̄− | → 0, and |μ − μ̄| → 0. 428
U

391 the LDA error rate estimation for Gaussian data [14, 25],
392 then we generalize LDA error rate from Gaussian data to Lemma 1 Thus, based on Theorem 1 and the sample mean 429
393 non-Gaussian data. Finally, we provide a discussion on relaxation (Assumption 1), the expected error rate of fΣ (x) 430
394 the classification accuracy comparison among vanilla LDA, can be reduced to 431
395 CRDA, and De-Sparse.
 
−1 (μ̄+ − μ̄− )
(μ̄+ − μ̄− ) Σ
396 3.3.1 LDA error rate for Gaussian data via random matrix 
ε(Σ, Σ) = Φ − .
−1 Σ Σ
2 (μ̄+ − μ̄− ) Σ −1 (μ̄+ − μ̄− )
397 theory
(12)
398 We first assume the data for binary classification follow two
399 (unknown) Gaussian distributions with the same covariance In this way, to improve the LDA classifier with the sample 432
400 matrix Σ but two different means μ+ and μ− , i.e., mean vectors, there needs an estimator Σ  to minimize or 433
401 N (μ+ , Σ) for positive samples and N (μ− , Σ) for negative 
lower the expected error rate ε(Σ, Σ). 434
402 samples, respectively. Given the LDA classifier fΣ (x)
403 based on the sample estimated mean vectors μ̄− , μ̄+ and a Lemma 2 Furthermore, when the estimated covariance 435
404 specific covariance matrix Σ, the expected error rate of a  is set to the oracle one Σ (the LDA is perfectly
matrix Σ 436
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Improving covariance-regularized discriminant analysis...

437 fitted with the data), the expected error rate reaches the data distribution with density functions P (x, l) and l ∈ 471
438 optimal error rate {+1, −1}, can be written as: 472
  
(μ̄+ − μ̄− ) −1 (μ̄+ − μ̄− )
Σ err(μ̄+ , μ̄− , Σ̄) = 1[l = fΣ (x)] d P (x, l).
ε(Σ, Σ)
 =Φ − . (13)
2
Consider the error rate on Gaussian distribution ε(Σ,  Σ ∗ ), 473
439 Above result suggests that when the covariance matrix Σ  the conditional probability based on Gaussian distribution 474
440  → Σ and Σ
is perfectly estimated Σ −1 Σ → I , the LDA PΣl∗ (x|l), and the conditional probability P (x, l) = P (x|l)· 475
441 classifier would approach to its optimal error rate. πl . 476
   
 ∗  
442 The estimate in (12) reduces the estimation of LDA  ε(Σ, Σ ) + πl P (x|l) − PΣl∗ (x|l)) dx
l∈{+1,−1} x
443 classification error rate to the divergence between the
444 population covariance matrix Σ and the estimated one Σ.  Consider Pinsker’s inequality to bound the divergence. 477
445 On the other hand, (13) models the generalization error of a 
 DKL (Xl N(μl ,Σl∗ ) )
446 model with “perfectly-fitted” covariances [14]. ≤ ε(Σ̂, Σ ∗ ) + πl
2
l∈{−1,+1} 478
447 3.3.2 LDA error rate for non-Gaussian data via
Kullback-Leibler divergence

F
448
Lemma 3 suggests that we can consider any distribution 479

O
as the combination of its nearest Gaussian distribution 480
We deliver the analysis on LDA classification error rate
(i.e., N (μ̄, Σ ∗ )) and other non-Gaussian components [54].
449

O
481
450 through incorporating additional assumptions as follows.
Given an LDA classifier fΣ̂ , the error rate is upper 482

PR
bounded by two factors: (1) the error rate of fΣ̂ on the 483
451 Assumption 2 Suppose every positive sample x+ and
nearest Gaussian distribution of the data, i.e., ε(Σ̂, Σ ∗ ), 484
452 negative sample x− are realized from random variables X+
and (2) the divergence between the data distribution and the 485
and X− respectively. We denote μ+ = E [X+ ] and μ− =
D
453
Gaussian distribution. Considering Lemma 1, we can further 486
E [X− ] as the expectations of X+ and X− respectively. In
TE

454
∗ and Σ ∗ as the oracle covariance conclude that two factors: the divergence of the given data 487
455 this way we define Σ+ −
to the Gaussian distribution DKL (Pl N (μ̄l , Σ ∗ )) and the 488
456 matrices for two classes respectively, such that −1
statistical convergence of Σ ∗ to Σ −1 would affect the
EC

489


Σ+ = E (X+ − μ+ ) (X+ − μ+ ) and performance of LDA on classification accuracy. 490


R


Σ− = E (X− − μ− ) (X− − μ− ) . 3.3.3 Performance Comparisons 491
R

We compare the convergence rate of T and Θ  to the


O

457 We further denote the oracle between-class covariance 492


458 matrix Σ ∗ = (Σ+∗ + Σ ∗ )/2. To simplify our analysis, we
− inverse population covariance matrix Θ ∗ = Σ ∗−1 , so 493
C


further assume Σ ≈ Σ+∗ ≈ Σ∗ .
459 − as to understand the accuracy of De-Sparse and CRDA. 494
N

Since the divergence of the datasets to the nearest Gaussian 495


Lemma 3 Considering the divergence between the Gaus-
U

460 distributions should be the same for both algorithms, 496


sian distributions N (μ+ , Σ+ ∗ ) and N (μ , Σ ∗ ) to the real
461 − − we mainly compared the Gaussian error terms for De- 497
data, we could bound the expected error rate of the LDA
Sparse and CRDA, i.e., ε(T−1 , Σ ∗ ) vs. ε(Θ
−1 , Σ ∗ ). More
462
463 classifier fΣ (x) (i.e., the LDA classifier based on parame- 498

464  as:
ters μ̄+ , μ̄− , Σ) specifically, [14] demonstrated the connections between the 499
Gaussian data error terms and the spectral properties of T 500
⎛  ⎞  Considering Lemma 2, we hope to understand (1)
 DKL (Xl N(μl ,Σl∗ ) ) and Θ. 501
  ⎝ε(Σ,
err(μ̄+ , μ̄− , Σ)  Σ ∗) + πl ⎠ how close the matrices TΣ ∗ and ΘΣ  ∗ would approach 502
2
l∈{−1,+1} to I matrix and (2) how the spectrum of these matrices 503
(14) behaves [55], such that 504

465 where DKL (Xl N (μl , Σ ∗ )) refers to the Kullback–Leibler TΣ ∗ − I 2 = (T − Θ ∗ )Σ ∗ 2
466 divergence between the distribution of the data Xl and ≤ λmax (Σ ∗ ) T − Θ ∗ 2 , and
467 Gaussian distribution N (μ̄l , Σl∗ ).  ∗−I  − Θ ∗ )Σ ∗ 2
ΘΣ 2 = (Θ
 − Θ ∗ 2,
≤ λmax (Σ ∗ ) Θ (15)
468 Proof To prove Lemma 3, we first define the error function
469 1[l = fΣ (x)] = 1: when l = fΣ (x) and 1[l = fΣ (x)] = 0 where λmax (·) refers to the largest eigenvalue of the input 505
470 when 1= fΣ (x). Then, the error rate of fΣ (x), for any matrix. Obviously, the terms of λmax (Σ ∗ ), T − Θ ∗ 2 and 506
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Sijia Yang et al.

507 Θ − Θ ∗ 2 are non-negative. When T − Θ ∗ 2 → 0 and develop into the target disease via classification of the 547
508 Θ−Θ ∗ 2 → 0, then optimal error rates would be achieved vector x to +1 (diagnosed as the positive result) or −1 548
509 asymptotically. In this way, we are wondering whether T (diagnosed as the negative result). 549
510 would converge to Θ ∗ faster than Θ in the spectrum-norm – Performance Metrics - To demonstrate the effective- 550
511 distance. ness of predictive analytics of diseases, we compared 551
512 Considering Corollaries 3 and 5, the sharp spectrum- all these methods with other competitors using Accu- 552
513 norm statistical convergence rate of Graphical Lasso is not racy and F1-score. Specifically, Accuracy characterizes 553
514 known in any of the previous studies [13, 17, 27, 56], the proportion of patients who are accurately classified 554
515 while the spectrum-norm statistical convergence rate of de- by the algorithms. F1-Score measures both correct- 555
516 sparsified Graphical Lasso could be easily derived from the ness and completeness of the prediction. Of-course, 556
517 ∞ -norm rate. In this way, we compare CRDA and De- we also include other metrics such as sensitivity and 557
518 Sparse through the 2 -norm statistical convergence rate of specificity to evaluate the performance of predictive 558
519 their inverse covariance matrix estimators. Thus, we can analytics through addressing the medical concerns. 559
520 derive the 2 -norm convergence rate as: – Data for Evaluation - In the experiments, we use 560
  the de-identified EHR data of 200,000 students from 561
d log d ten U.S. universities [28]. Among all diseases recorded,
T − Θ 2 = O

. (16) 562
N we choose mental health disorders, including anxiety 563

F
disorders, mood disorders, depression disorders, and 564
521 On the other side, [17] demonstrated that the 2 -norm

O
other related disorders, as one targeted disease for early 565
522  is
convergence rate for the Graphical Lasso estimator Θ detection [57]. We represent each patient using his/her 566

O
  diagnosis-frequency vector based on the clustered 567
(d + s) log d

PR
−Θ 2 = O
Θ ∗
, codeset (d = 295). 568
N
Note that, to prepare the training and testing sets, we use 569
where s has been defined in Corollary 3. We conclude the the complete EHR data of the patients who haven’t been 570
D
523
524 convergence rate of T is faster than Θ. In this way, we diagnosed with any of mental health disorders (negative 571
TE

525 consider De-Sparse would outperform CRDA as it adopts a samples). For patients having been diagnosed with any 572
526 better inverse covariance matrix estimator. mental health disorders (positive samples), we collect their 573
EC

EHR data from the first visit to the last visit that was 90 574
days before the diagnosis of mental health disorders. Thus, 575
527 4 Experiments we can simulate the early detection of diseases with 90 days 576
R

in advance. 577
R

528 In this section, we first introduce the design of the


O

529 experiments to evaluate the superiority of the proposed 4.2 Design of experiment 578
De-Sparse framework. Then, we present the experimental
C

530
531 results, including the performance comparison between To understand the performance impact of De-Sparse 579
N

532 the De-Sparse framework, existing LDA baselines, and beyond classic LDA, we first propose four LDA baseline 580
U

533 other predictive models, followed by a comparison between approaches to compare against De-Sparse, then, three 581
534 inverse covariance matrix to support our theoretical analysis discriminative learning models are used for the comparison: 582
535 of De-Sparse.
– LDA Derivatives: LDA, Shrinkage, DIAG and Ye-LDA – 583
The first three algorithms are all based on the common 584
536 4.1 Experiment setups implementation of generalized Fishier’s discriminant 585
analysis listed in (2). Specifically, LDA uses the sample 586
537 In this study, to evaluate De-Sparse, we use predictive
covariance estimation, and inverts the covariance matrix 587
538 analytics of disease based on Electronic Health Records
using pseudo-inverse [58] when the matrix inverse is 588
539 (EHR) data.
not available; Shrinkage is based on LDA, using a sparse 589
540 – Predictive Analytics of Diseases - Given N training estimation of sample covariance as follows, 590
541 samples (i.e., the EHR data of each patient) along with
542 corresponding labels i.e., (x1 , l1 ), (x2 , l2 ) . . . (xN , lN ) Σβ = β ∗ Σ̄ + (1 − β) ∗ diag(Σ̄) and Θβ = Σβ−1 , (17)
543 where li ∈ {−1, +1} refers to whether the patient i is
544 diagnosed with the target disease or not (i.e., positive where diag(Σ̄) refers to the diagonal matrix of the 591
545 sample or negative sample), the predictive analytics sample estimation Σ̄. DIAG is a special Shrinkage 592
546 task is to determine whether a new patient would approach with β = 0.0. Ye-LDA is derived from [7, 593
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Improving covariance-regularized discriminant analysis...

594 58]. In our research, we focus on studying the improve- 4.3.1 Comparisons on accuracy and F1-score 641
595 ment of LDA classification caused by (inverse) covari-
596 ance matrix regularization, thus we don’t compare our As can be seen from the results in Tables 1, 2, 3, 4, and 5, 642
597 method to linear-coefficient-regularized LDA classi- De-Sparse clearly outperforms the baseline algorithms in 643
598 fiers [4, 10, 11] or heuristic LDA derivation [6]. terms of overall accuracy and F1-score. Specifically, De- 644
599 – Downstream Classifiers: Support Vector Machine Sparse achieves 18.6%–21.3% increase in accuracy and 645
600 (SVM), Logistic Regression (Logit. Reg.) and AdaBoost 22.9%–32% increase in F1-score over LDA; De-Sparse 646
601 – Inspired by the previous studies [21, 59] in EHR data achieves 17.9% increase in accuracy and 31.5%–40.6% 647
602 mining, we use a linear binary SVM classifier with increase in F1-score over DIAG. Compared to Shrinkage 648
603 fine-tuned parameters as well as the Logistic Regres- and CRDA, the accuracy and F1-score of De-Sparse in 649
604 sion classifier. Further, we compare our algorithm to most parameter settings are 0.3%–18.9% higher and 0.14%– 650
605 AdaBoost, where AdaBoost-10 refers to the AdaBoost 71.8% higher, respectively. Compared to SVM, Logistic 651
606 classifier using 10 Logistic Regression instances and Regression, and AdaBoost, De-Sparse can achieve 2.3%– 652
607 AdaBoost-50 leverages 50 instances. 19.4% higher accuracy and 7.5%–43.5% higher F1-score. 653
In this case, we can conclude that the classic LDA 654
608 With the seven baseline algorithms, we perform exper-
model cannot perform as well as many other predictive 655
609 iments with training sets of varying sizes and cross-
models such as SVM and AdaBoost. However, De-Sparse 656
610 validation. To train the classifiers, we randomly selected

F
significantly outperforms these methods in all settings. 657
611 50, 100, 150, 200, and 250 positive patients, and randomly

O
Thus, we can conclude that De-Sparse overall outperforms 658
612 selected the same number of negative patients. Then, we
the baseline algorithms in all experimental settings. Note 659

O
613 test the classifiers, using a testing set with 1000 randomly
that, though De-Sparse outperforms CRDA marginally, De- 660
614 selected positive patients and the same number of negative

PR
Sparse enjoys a more tight upper bound of error rate. 661
615 patients. Note that there is no over-lap between training set
616 and the paired testing set. All algorithms used in our work
4.3.2 Trade-off between sensitivity and specificity 662
617 are implemented with JSAT 1 and glasso in R 2 .
D
TE

We also intend to compare De-Sparse with baseline meth- 663


618 4.3 Overall comparison
ods with respect to the needs of medicines. Specifically, in 664
addition to accuracy and F1-score, we focus on two more 665
EC

619 We include the comparison results of De-Sparse evaluation


metrics [60]: 666
620 in Tables 1, 2, 3, 4, and 5 for models learned from 50 ∼
621 250 × 2 labeled samples respectively. All experiments – Sensitivity - In medical diagnosis, the sensitivity 667
R

622 are done with cross validation using random sampling measures the ability of the prediction algorithms to 668
R

623 without replacement and repeated 10 times. Specifically, correctly identify the patients with the disease (true 669
O

624 we compare the performance using various experimental positive rate). More specific, we estimate sensitivity as 670
625 settings, such as the varying parameters for model training
C

626 and number of days in advance for early detection(e.g., 30 Sensitivity


N

days, 60 days and 90 days). We carry out the experiments # Patients with the diseases ∩ Patients predicted as positive
627 = .
U

628 with varying Days in Advance settings, so as to evaluate #Patients with the diseases
(18)
629 the performance of algorithms for predictive analytics. As
630 was addressed, we actually need to use the past EHR data 671
631 (before the diagnoses of mental disorders) as the features – Specificity - In contrast, the specificity metric 672

632 for prediction. More specific, for positive samples in both characterizes the ability of the algorithms to correctly 673

633 training and testing datasets, we backtracked their EHR identify ones without the disease (true negative rate). 674

634 data from their prediction dates. For every positive sample, More specific, we estimate specificity as 675

635 the prediction date is set as 30, 60 and 90 days before


Specificity
636 the medicare visit that the patient received his/her first
# Patients without the diseases ∩ Patients predicted as negative
637 diagnoses of “anxiety disorders, mood disorders, depression = .
#Patients without the diseases
638 disorders, and other related disorders”. In this way, we carry (19)
639 out the experiments in three categories according to the
Please see also Tables 1, 2, 3, 4, and 5. In terms of 676
640 varying days in advance.
specificity, the baseline algorithms outperform De-Sparse, 677
in the most of cases. However, in terms of sensitivity and 678
1 [Link]
specificity trade-off, De-Sparse on average gains 19.5% 679
2 [Link]
higher sensitivity while sacrificing 8.2% specificity, when 680
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Sijia Yang et al.

Table 1 Performance
Comparison with Training Accuracy F1-Score Sensitivity Specificity
Set:50× 2, Testing Set:
2000×2 Days in Advance: 30
AdaBoost (×10) 0.637 ± 0.028 0.571 ± 0.057 0.491 ± 0.085 0.783 ± 0.053
AdaBoost (×50) 0.640 ± 0.024 0.570 ± 0.061 0.487 ± 0.093 0.792 ± 0.053
CRDA (λ = 1.0) 0.662 ± 0.017 0.692 ± 0.028 0.762 ± 0.069 0.563 ± 0.058
CRDA (λ = 10.0) 0.670 ± 0.017 0.713 ± 0.010 0.819 ± 0.023 0.520 ± 0.047
CRDA (λ = 100.0) 0.664 ± 0.020 0.713 ± 0.008 0.834 ± 0.033 0.494 ± 0.068
LDA 0.555 ± 0.026 0.565 ± 0.033 0.579 ± 0.048 0.531 ± 0.040
Logistic Regression 0.615 ± 0.055 0.469 ± 0.206 0.395 ± 0.200 0.835 ± 0.094
De-Sparse (λ = 1.0) 0.658 ± 0.019 0.677 ± 0.034 0.723 ± 0.073 0.592 ± 0.050
De-Sparse (λ = 10.0) 0.672 ± 0.015 0.713 ± 0.010 0.813 ± 0.025 0.532 ± 0.042
De-Sparse (λ = 100.0) 0.668 ± 0.018 0.714 ± 0.008 0.830 ± 0.026 0.506 ± 0.056
SVM 0.611 ± 0.026 0.619 ± 0.034 0.632 ± 0.050 0.590 ± 0.029
DIAG 0.568 ± 0.014 0.515 ± 0.026 0.460 ± 0.042 0.676 ± 0.046
Shrinkage (β = 0.25) 0.574 ± 0.014 0.538 ± 0.025 0.499 ± 0.041 0.649 ± 0.045
Shrinkage (β = 0.5) 0.560 ± 0.033 0.438 ± 0.220 0.413 ± 0.210 0.708 ± 0.152

F
Shrinkage (β = 0.75) 0.560 ± 0.025 0.480 ± 0.163 0.448 ± 0.158 0.672 ± 0.118

O
Days in Advance: 60

O
AdaBoost (×10) 0.646 ± 0.021 0.596 ± 0.054 0.531 ± 0.095 0.762 ± 0.057

PR
AdaBoost (×50) 0.639 ± 0.027 0.569 ± 0.083 0.491 ± 0.111 0.788 ± 0.060
CRDA (λ = 1.0) 0.654 ± 0.016 0.690 ± 0.016 0.774 ± 0.067 0.535 ± 0.088
CRDA (λ = 10.0) 0.653 ± 0.019 0.706 ± 0.010 0.833 ± 0.053 0.474 ± 0.083
D
CRDA (λ = 100.0) 0.643 ± 0.024 0.701 ± 0.028 0.844 ± 0.098 0.443 ± 0.124
TE

LDA 0.556 ± 0.028 0.550 ± 0.042 0.547 ± 0.072 0.565 ± 0.065


Logistic Regression 0.631 ± 0.031 0.535 ± 0.108 0.447 ± 0.132 0.814 ± 0.073
De-Sparse (λ = 1.0) 0.655 ± 0.012 0.675 ± 0.023 0.723 ± 0.070 0.587 ± 0.074
EC

De-Sparse (λ = 10.0) 0.661 ± 0.016 0.708 ± 0.009 0.823 ± 0.051 0.499 ± 0.077
De-Sparse (λ = 100.0) 0.649 ± 0.021 0.705 ± 0.020 0.844 ± 0.082 0.454 ± 0.110
R

SVM 0.627 ± 0.019 0.625 ± 0.027 0.625 ± 0.053 0.629 ± 0.056


R

DIAG 0.565 ± 0.011 0.514 ± 0.046 0.468 ± 0.076 0.662 ± 0.072


Shrinkage (β = 0.25) 0.568 ± 0.012 0.530 ± 0.040 0.492 ± 0.069 0.644 ± 0.063
O

Shrinkage (β = 0.5) 0.567 ± 0.013 0.528 ± 0.038 0.489 ± 0.067 0.646 ± 0.059
C

Shrinkage (β = 0.75) 0.561 ± 0.025 0.477 ± 0.164 0.444 ± 0.163 0.677 ± 0.120
N

Days in Advance: 90
U

AdaBoost (×10) 0.627 ± 0.034 0.572 ± 0.063 0.507 ± 0.091 0.747 ± 0.054
AdaBoost (×50) 0.632 ± 0.035 0.575 ± 0.054 0.504 ± 0.077 0.759 ± 0.058
CRDA (λ = 1.0) 0.641 ± 0.018 0.663 ± 0.041 0.716 ± 0.106 0.566 ± 0.091
CRDA (λ = 10.0) 0.651 ± 0.018 0.693 ± 0.034 0.797 ± 0.093 0.505 ± 0.096
CRDA (λ = 100.0) 0.634 ± 0.040 0.675 ± 0.101 0.808 ± 0.188 0.459 ± 0.173
LDA 0.546 ± 0.025 0.532 ± 0.038 0.518 ± 0.058 0.574 ± 0.046
Logistic Regression 0.597 ± 0.058 0.423 ± 0.217 0.351 ± 0.207 0.843 ± 0.096
De-Sparse (λ = 1.0) 0.642 ± 0.022 0.663 ± 0.035 0.710 ± 0.078 0.574 ± 0.060
De-Sparse (λ = 10.0) 0.658 ± 0.016 0.696 ± 0.022 0.787 ± 0.073 0.528 ± 0.084
De-Sparse (λ = 100.0) 0.641 ± 0.031 0.683 ± 0.081 0.808 ± 0.164 0.475 ± 0.148
SVM 0.597 ± 0.034 0.600 ± 0.036 0.606 ± 0.047 0.587 ± 0.046
DIAG 0.568 ± 0.023 0.514 ± 0.048 0.464 ± 0.074 0.672 ± 0.066
Shrinkage (β = 0.25) 0.569 ± 0.020 0.530 ± 0.041 0.490 ± 0.065 0.648 ± 0.054
Shrinkage (β = 0.5) 0.565 ± 0.021 0.519 ± 0.041 0.473 ± 0.059 0.657 ± 0.044
Shrinkage (β = 0.75) 0.559 ± 0.019 0.511 ± 0.040 0.465 ± 0.061 0.653 ± 0.050
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Improving covariance-regularized discriminant analysis...

Table 2 Performance
Comparison with Training Accuracy F1-Score Sensitivity Specificity
Set:100× 2, Testing Set:
2000×2 Days in Advance: 30
AdaBoost (×10) 0.632 ± 0.029 0.541 ± 0.095 0.452 ± 0.117 0.812 ± 0.065
AdaBoost (×50) 0.631 ± 0.032 0.538 ± 0.099 0.447 ± 0.120 0.814 ± 0.062
CRDA (λ = 1.0) 0.674 ± 0.012 0.708 ± 0.019 0.792 ± 0.043 0.556 ± 0.029
CRDA (λ = 10.0) 0.675 ± 0.006 0.722 ± 0.008 0.844 ± 0.022 0.507 ± 0.017
CRDA (λ = 100.0) 0.664 ± 0.010 0.718 ± 0.004 0.858 ± 0.031 0.469 ± 0.048
LDA 0.594 ± 0.016 0.592 ± 0.019 0.591 ± 0.027 0.597 ± 0.018
Logistic Regression 0.593 ± 0.054 0.394 ± 0.200 0.305 ± 0.180 0.881 ± 0.075
De-Sparse (λ = 1.0) 0.674 ± 0.018 0.700 ± 0.025 0.765 ± 0.050 0.582 ± 0.026
De-Sparse (λ = 10.0) 0.681 ± 0.006 0.724 ± 0.006 0.838 ± 0.018 0.524 ± 0.020
De-Sparse (λ = 100.0) 0.668 ± 0.009 0.720 ± 0.006 0.854 ± 0.028 0.481 ± 0.041
SVM 0.636 ± 0.016 0.642 ± 0.024 0.655 ± 0.044 0.618 ± 0.025
DIAG 0.594 ± 0.019 0.562 ± 0.034 0.524 ± 0.050 0.663 ± 0.033
Shrinkage (β = 0.25) 0.600 ± 0.020 0.582 ± 0.031 0.559 ± 0.045 0.641 ± 0.022
Shrinkage (β = 0.5) 0.581 ± 0.044 0.467 ± 0.235 0.449 ± 0.228 0.714 ± 0.144

F
Shrinkage (β = 0.75) 0.599 ± 0.014 0.582 ± 0.020 0.559 ± 0.029 0.639 ± 0.022

O
Days in Advance: 60

O
AdaBoost (×10) 0.633 ± 0.024 0.537 ± 0.076 0.439 ± 0.110 0.827 ± 0.067

PR
AdaBoost (×50) 0.623 ± 0.024 0.507 ± 0.065 0.396 ± 0.089 0.850 ± 0.052
CRDA (λ = 1.0) 0.676 ± 0.016 0.711 ± 0.015 0.797 ± 0.041 0.555 ± 0.052
CRDA (λ = 10.0) 0.672 ± 0.019 0.719 ± 0.015 0.837 ± 0.025 0.508 ± 0.039
D
CRDA (λ = 100.0) 0.668 ± 0.017 0.716 ± 0.013 0.838 ± 0.038 0.498 ± 0.054
TE

LDA 0.603 ± 0.024 0.599 ± 0.026 0.595 ± 0.033 0.610 ± 0.032


Logistic Regression 0.613 ± 0.042 0.462 ± 0.164 0.362 ± 0.147 0.863 ± 0.069
De-Sparse (λ = 1.0) 0.679 ± 0.011 0.707 ± 0.014 0.776 ± 0.041 0.582 ± 0.043
EC

De-Sparse (λ = 10.0) 0.676 ± 0.016 0.720 ± 0.012 0.834 ± 0.026 0.518 ± 0.039
De-Sparse (λ = 100.0) 0.671 ± 0.017 0.718 ± 0.012 0.838 ± 0.029 0.504 ± 0.045
R

SVM 0.644 ± 0.016 0.645 ± 0.020 0.650 ± 0.038 0.637 ± 0.037


R

DIAG 0.596 ± 0.015 0.562 ± 0.033 0.522 ± 0.058 0.670 ± 0.054


Shrinkage (β = 0.25) 0.600 ± 0.016 0.580 ± 0.024 0.554 ± 0.040 0.645 ± 0.038
O

Shrinkage (β = 0.5) 0.596 ± 0.035 0.532 ± 0.178 0.513 ± 0.174 0.680 ± 0.113
C

Shrinkage (β = 0.75) 0.596 ± 0.039 0.532 ± 0.179 0.513 ± 0.175 0.678 ± 0.115
N

Days in Advance: 90
U

AdaBoost (×10) 0.626 ± 0.022 0.519 ± 0.061 0.412 ± 0.093 0.840 ± 0.058
AdaBoost (×50) 0.631 ± 0.017 0.523 ± 0.056 0.413 ± 0.087 0.849 ± 0.053
CRDA (λ = 1.0) 0.674 ± 0.013 0.709 ± 0.020 0.796 ± 0.052 0.552 ± 0.047
CRDA (λ = 10.0) 0.674 ± 0.010 0.721 ± 0.006 0.845 ± 0.021 0.502 ± 0.034
CRDA (λ = 100.0) 0.666 ± 0.015 0.719 ± 0.006 0.856 ± 0.025 0.477 ± 0.052
LDA 0.605 ± 0.017 0.607 ± 0.026 0.612 ± 0.045 0.598 ± 0.028
Logistic Regression 0.611 ± 0.036 0.453 ± 0.130 0.345 ± 0.136 0.876 ± 0.067
De-Sparse (λ = 1.0) 0.675 ± 0.013 0.700 ± 0.026 0.764 ± 0.061 0.587 ± 0.045
De-Sparse (λ = 10.0) 0.682 ± 0.007 0.725 ± 0.007 0.840 ± 0.025 0.523 ± 0.030
De-Sparse (λ = 100.0) 0.669 ± 0.013 0.721 ± 0.006 0.853 ± 0.023 0.486 ± 0.046
SVM 0.632 ± 0.017 0.638 ± 0.023 0.649 ± 0.039 0.616 ± 0.026
DIAG 0.597 ± 0.015 0.574 ± 0.039 0.549 ± 0.072 0.644 ± 0.063
Shrinkage (β = 0.25) 0.593 ± 0.034 0.531 ± 0.179 0.517 ± 0.182 0.668 ± 0.120
Shrinkage (β = 0.5) 0.602 ± 0.015 0.589 ± 0.028 0.575 ± 0.053 0.628 ± 0.043
Shrinkage (β = 0.75) 0.599 ± 0.015 0.586 ± 0.025 0.570 ± 0.045 0.629 ± 0.037
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Sijia Yang et al.

Table 3 Performance
Comparison with Training Accuracy F1-Score Sensitivity Specificity
Set:150× 2, Testing Set:
2000×2 Days in Advance: 30
AdaBoost (×10) 0.615 ± 0.010 0.484 ± 0.033 0.363 ± 0.039 0.867 ± 0.024
AdaBoost (×50) 0.615 ± 0.007 0.482 ± 0.025 0.359 ± 0.032 0.871 ± 0.023
CRDA (λ = 1.0) 0.682 ± 0.008 0.723 ± 0.008 0.829 ± 0.021 0.534 ± 0.019
CRDA (λ = 10.0) 0.671 ± 0.013 0.721 ± 0.008 0.851 ± 0.016 0.490 ± 0.035
CRDA (λ = 100.0) 0.662 ± 0.014 0.718 ± 0.007 0.861 ± 0.020 0.464 ± 0.044
LDA 0.613 ± 0.012 0.611 ± 0.018 0.610 ± 0.038 0.615 ± 0.037
Logistic Regression 0.581 ± 0.045 0.352 ± 0.189 0.255 ± 0.142 0.908 ± 0.053
De-Sparse (λ = 1.0) 0.681 ± 0.009 0.712 ± 0.012 0.790 ± 0.028 0.572 ± 0.020
De-Sparse (λ = 10.0) 0.681 ± 0.007 0.727 ± 0.006 0.849 ± 0.013 0.512 ± 0.019
De-Sparse (λ = 100.0) 0.667 ± 0.013 0.720 ± 0.007 0.857 ± 0.020 0.478 ± 0.041
SVM 0.650 ± 0.012 0.660 ± 0.014 0.680 ± 0.024 0.620 ± 0.023
DIAG 0.619 ± 0.014 0.610 ± 0.031 0.600 ± 0.056 0.637 ± 0.037
Shrinkage (β = 0.25) 0.599 ± 0.051 0.500 ± 0.251 0.503 ± 0.256 0.696 ± 0.156
Shrinkage (β = 0.5) 0.611 ± 0.039 0.562 ± 0.189 0.566 ± 0.195 0.656 ± 0.121

F
Shrinkage (β = 0.75) 0.615 ± 0.009 0.611 ± 0.024 0.608 ± 0.051 0.623 ± 0.045

O
Days in Advance: 60

O
AdaBoost (×10) 0.625 ± 0.039 0.512 ± 0.131 0.424 ± 0.156 0.826 ± 0.081

PR
AdaBoost (×50) 0.637 ± 0.024 0.554 ± 0.072 0.466 ± 0.113 0.809 ± 0.068
CRDA (λ = 1.0) 0.677 ± 0.017 0.717 ± 0.015 0.818 ± 0.028 0.536 ± 0.032
CRDA (λ = 10.0) 0.671 ± 0.012 0.721 ± 0.008 0.848 ± 0.022 0.494 ± 0.038
D
CRDA (λ = 100.0) 0.662 ± 0.014 0.718 ± 0.006 0.861 ± 0.031 0.463 ± 0.055
TE

LDA 0.623 ± 0.014 0.621 ± 0.023 0.619 ± 0.040 0.627 ± 0.023


Logistic Regression 0.600 ± 0.054 0.412 ± 0.217 0.331 ± 0.195 0.869 ± 0.090
De-Sparse (λ = 1.0) 0.681 ± 0.016 0.711 ± 0.016 0.787 ± 0.033 0.574 ± 0.036
EC

De-Sparse (λ = 10.0) 0.678 ± 0.011 0.724 ± 0.009 0.843 ± 0.017 0.513 ± 0.023
De-Sparse (λ = 100.0) 0.667 ± 0.014 0.720 ± 0.007 0.856 ± 0.028 0.477 ± 0.050
R

SVM 0.649 ± 0.017 0.654 ± 0.025 0.665 ± 0.042 0.633 ± 0.024


R

DIAG 0.615 ± 0.018 0.597 ± 0.032 0.574 ± 0.054 0.656 ± 0.045


Shrinkage (β = 0.25) 0.618 ± 0.018 0.605 ± 0.031 0.587 ± 0.051 0.649 ± 0.039
O

Shrinkage (β = 0.5) 0.608 ± 0.039 0.548 ± 0.184 0.533 ± 0.181 0.683 ± 0.110
C

Shrinkage (β = 0.75) 0.618 ± 0.015 0.602 ± 0.027 0.581 ± 0.045 0.655 ± 0.033
N

Days in Advance: 90
U

AdaBoost (×10) 0.630 ± 0.023 0.531 ± 0.075 0.436 ± 0.123 0.824 ± 0.082
AdaBoost (×50) 0.630 ± 0.023 0.534 ± 0.078 0.441 ± 0.126 0.820 ± 0.083
CRDA (λ = 1.0) 0.674 ± 0.012 0.708 ± 0.017 0.794 ± 0.045 0.553 ± 0.039
CRDA (λ = 10.0) 0.671 ± 0.011 0.720 ± 0.007 0.845 ± 0.021 0.498 ± 0.035
CRDA (λ = 100.0) 0.663 ± 0.013 0.718 ± 0.004 0.857 ± 0.025 0.470 ± 0.050
LDA 0.611 ± 0.020 0.610 ± 0.025 0.608 ± 0.039 0.614 ± 0.024
Logistic Regression 0.614 ± 0.045 0.463 ± 0.174 0.374 ± 0.180 0.853 ± 0.098
De-Sparse (λ = 1.0) 0.672 ± 0.018 0.693 ± 0.030 0.745 ± 0.065 0.600 ± 0.042
De-Sparse (λ = 10.0) 0.678 ± 0.010 0.722 ± 0.009 0.836 ± 0.026 0.521 ± 0.033
De-Sparse (λ = 100.0) 0.668 ± 0.010 0.720 ± 0.005 0.851 ± 0.022 0.485 ± 0.039
SVM 0.639 ± 0.015 0.645 ± 0.020 0.657 ± 0.035 0.622 ± 0.026
DIAG 0.610 ± 0.012 0.602 ± 0.022 0.590 ± 0.042 0.631 ± 0.031
Shrinkage (β = 0.25) 0.613 ± 0.011 0.608 ± 0.019 0.601 ± 0.036 0.626 ± 0.027
Shrinkage (β = 0.5) 0.602 ± 0.036 0.547 ± 0.183 0.540 ± 0.183 0.665 ± 0.114
Shrinkage (β = 0.75) 0.601 ± 0.036 0.545 ± 0.183 0.536 ± 0.182 0.665 ± 0.113
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Improving covariance-regularized discriminant analysis...

Table 4 Performance
Comparison with Training Accuracy F1-Score Sensitivity Specificity
Set:200× 2, Testing Set:
2000×2 Days in Advance: 30
AdaBoost (×10) 0.618 ± 0.026 0.485 ± 0.082 0.373 ± 0.115 0.863 ± 0.064
AdaBoost (×50) 0.618 ± 0.022 0.491 ± 0.064 0.377 ± 0.092 0.859 ± 0.052
CRDA (λ = 1.0) 0.688 ± 0.006 0.725 ± 0.007 0.824 ± 0.017 0.553 ± 0.016
CRDA (λ = 10.0) 0.680 ± 0.005 0.725 ± 0.005 0.847 ± 0.013 0.513 ± 0.013
CRDA (λ = 100.0) 0.669 ± 0.011 0.721 ± 0.003 0.855 ± 0.026 0.483 ± 0.047
LDA 0.637 ± 0.006 0.644 ± 0.010 0.655 ± 0.021 0.620 ± 0.020
Logistic Regression 0.598 ± 0.046 0.411 ± 0.175 0.313 ± 0.159 0.883 ± 0.070
De-Sparse (λ = 1.0) 0.686 ± 0.007 0.717 ± 0.007 0.794 ± 0.017 0.578 ± 0.019
De-Sparse (λ = 10.0) 0.684 ± 0.006 0.729 ± 0.005 0.850 ± 0.007 0.519 ± 0.010
De-Sparse (λ = 100.0) 0.673 ± 0.009 0.723 ± 0.004 0.852 ± 0.024 0.494 ± 0.038
SVM 0.660 ± 0.012 0.671 ± 0.012 0.693 ± 0.014 0.626 ± 0.015
DIAG 0.623 ± 0.013 0.603 ± 0.024 0.575 ± 0.041 0.671 ± 0.029
Shrinkage (β = 0.25) 0.628 ± 0.013 0.621 ± 0.023 0.610 ± 0.039 0.646 ± 0.024
Shrinkage (β = 0.5) 0.619 ± 0.042 0.565 ± 0.190 0.560 ± 0.190 0.678 ± 0.110

F
Shrinkage (β = 0.75) 0.633 ± 0.012 0.629 ± 0.019 0.624 ± 0.034 0.642 ± 0.022

O
Days in Advance: 60

O
AdaBoost (×10) 0.605 ± 0.023 0.445 ± 0.085 0.325 ± 0.074 0.885 ± 0.033

PR
AdaBoost (×50) 0.616 ± 0.010 0.479 ± 0.038 0.356 ± 0.048 0.876 ± 0.032
CRDA (λ = 1.0) 0.684 ± 0.006 0.721 ± 0.006 0.818 ± 0.019 0.549 ± 0.023
CRDA (λ = 10.0) 0.674 ± 0.008 0.722 ± 0.006 0.844 ± 0.019 0.505 ± 0.026
D
CRDA (λ = 100.0) 0.673 ± 0.010 0.721 ± 0.006 0.845 ± 0.021 0.502 ± 0.035
TE

LDA 0.626 ± 0.009 0.622 ± 0.013 0.616 ± 0.028 0.635 ± 0.031


Logistic Regression 0.589 ± 0.038 0.380 ± 0.151 0.270 ± 0.113 0.908 ± 0.038
De-Sparse (λ = 1.0) 0.684 ± 0.010 0.710 ± 0.012 0.773 ± 0.027 0.595 ± 0.023
EC

De-Sparse (λ = 10.0) 0.682 ± 0.006 0.726 ± 0.007 0.844 ± 0.017 0.520 ± 0.014
De-Sparse (λ = 100.0) 0.675 ± 0.008 0.722 ± 0.006 0.843 ± 0.022 0.508 ± 0.031
R

SVM 0.651 ± 0.006 0.659 ± 0.010 0.675 ± 0.026 0.626 ± 0.028


R

DIAG 0.627 ± 0.012 0.615 ± 0.023 0.597 ± 0.045 0.657 ± 0.039


Shrinkage (β = 0.25) 0.618 ± 0.041 0.562 ± 0.188 0.553 ± 0.187 0.683 ± 0.111
O

Shrinkage (β = 0.5) 0.620 ± 0.040 0.565 ± 0.189 0.557 ± 0.187 0.683 ± 0.110
C

Shrinkage (β = 0.75) 0.616 ± 0.039 0.557 ± 0.186 0.544 ± 0.183 0.688 ± 0.109
N

Days in Advance: 90
U

AdaBoost (×10) 0.626 ± 0.033 0.507 ± 0.107 0.411 ± 0.153 0.840 ± 0.088
AdaBoost (×50) 0.632 ± 0.028 0.533 ± 0.092 0.441 ± 0.135 0.823 ± 0.080
CRDA (λ = 1.0) 0.682 ± 0.008 0.722 ± 0.008 0.825 ± 0.017 0.540 ± 0.020
CRDA (λ = 10.0) 0.664 ± 0.012 0.718 ± 0.006 0.856 ± 0.025 0.472 ± 0.044
CRDA (λ = 100.0) 0.656 ± 0.016 0.715 ± 0.005 0.865 ± 0.029 0.447 ± 0.058
LDA 0.631 ± 0.014 0.630 ± 0.018 0.631 ± 0.034 0.630 ± 0.032
Logistic Regression 0.605 ± 0.060 0.424 ± 0.232 0.353 ± 0.222 0.857 ± 0.107
De-Sparse (λ = 1.0) 0.684 ± 0.010 0.714 ± 0.014 0.789 ± 0.031 0.579 ± 0.020
De-Sparse (λ = 10.0) 0.676 ± 0.008 0.724 ± 0.004 0.852 ± 0.019 0.500 ± 0.030
De-Sparse (λ = 100.0) 0.658 ± 0.015 0.716 ± 0.005 0.863 ± 0.029 0.452 ± 0.057
SVM 0.657 ± 0.009 0.669 ± 0.015 0.693 ± 0.031 0.621 ± 0.024
DIAG 0.625 ± 0.013 0.614 ± 0.029 0.601 ± 0.055 0.648 ± 0.045
Shrinkage (β = 0.25) 0.627 ± 0.014 0.617 ± 0.030 0.604 ± 0.056 0.651 ± 0.043
Shrinkage (β = 0.5) 0.626 ± 0.013 0.616 ± 0.027 0.603 ± 0.051 0.650 ± 0.042
Shrinkage (β = 0.75) 0.626 ± 0.014 0.617 ± 0.023 0.604 ± 0.045 0.649 ± 0.043
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Sijia Yang et al.

Table 5 Performance
Comparison with Training Accuracy F1-Score Sensitivity Specificity
Set:250× 2, Testing Set:
2000×2 Days in Advance: 30
AdaBoost (×10) 0.620 ± 0.037 0.484 ± 0.110 0.380 ± 0.147 0.860 ± 0.076
AdaBoost (×50) 0.625 ± 0.033 0.499 ± 0.097 0.394 ± 0.138 0.856 ± 0.074
CRDA (λ = 1.0) 0.689 ± 0.010 0.726 ± 0.009 0.824 ± 0.021 0.553 ± 0.025
CRDA (λ = 10.0) 0.677 ± 0.012 0.722 ± 0.009 0.840 ± 0.020 0.513 ± 0.029
CRDA (λ = 100.0) 0.666 ± 0.014 0.719 ± 0.007 0.853 ± 0.027 0.479 ± 0.050
LDA 0.644 ± 0.009 0.645 ± 0.012 0.648 ± 0.023 0.640 ± 0.020
Logistic Regression 0.605 ± 0.057 0.424 ± 0.204 0.339 ± 0.200 0.870 ± 0.089
De-Sparse (λ = 1.0) 0.690 ± 0.007 0.719 ± 0.007 0.791 ± 0.022 0.589 ± 0.027
De-Sparse (λ = 10.0) 0.684 ± 0.009 0.726 ± 0.008 0.837 ± 0.015 0.531 ± 0.012
De-Sparse (λ = 100.0) 0.671 ± 0.012 0.721 ± 0.008 0.848 ± 0.026 0.494 ± 0.039
SVM 0.663 ± 0.013 0.673 ± 0.015 0.694 ± 0.024 0.632 ± 0.023
DIAG 0.633 ± 0.011 0.619 ± 0.028 0.599 ± 0.055 0.668 ± 0.046
Shrinkage (β = 0.25) 0.625 ± 0.044 0.569 ± 0.192 0.562 ± 0.193 0.689 ± 0.108
Shrinkage (β = 0.5) 0.626 ± 0.044 0.569 ± 0.192 0.561 ± 0.192 0.691 ± 0.106

F
Shrinkage (β = 0.75) 0.639 ± 0.011 0.633 ± 0.022 0.624 ± 0.039 0.653 ± 0.025

O
Days in Advance: 60

O
AdaBoost (×10) 0.635 ± 0.026 0.539 ± 0.087 0.449 ± 0.141 0.820 ± 0.091

PR
AdaBoost (×50) 0.634 ± 0.027 0.536 ± 0.089 0.445 ± 0.144 0.823 ± 0.091
CRDA (λ = 1.0) 0.692 ± 0.006 0.729 ± 0.006 0.827 ± 0.014 0.557 ± 0.015
CRDA (λ = 10.0) 0.682 ± 0.008 0.730 ± 0.004 0.860 ± 0.019 0.504 ± 0.031
D
CRDA (λ = 100.0) 0.674 ± 0.014 0.726 ± 0.005 0.864 ± 0.025 0.483 ± 0.051
TE

LDA 0.642 ± 0.011 0.643 ± 0.015 0.645 ± 0.025 0.638 ± 0.017


Logistic Regression 0.623 ± 0.048 0.489 ± 0.184 0.411 ± 0.195 0.835 ± 0.105
De-Sparse (λ = 1.0) 0.691 ± 0.008 0.717 ± 0.009 0.781 ± 0.019 0.601 ± 0.017
EC

De-Sparse (λ = 10.0) 0.689 ± 0.004 0.733 ± 0.004 0.854 ± 0.013 0.524 ± 0.017
De-Sparse (λ = 100.0) 0.676 ± 0.013 0.727 ± 0.005 0.863 ± 0.024 0.488 ± 0.048
R

SVM 0.662 ± 0.008 0.668 ± 0.012 0.681 ± 0.023 0.642 ± 0.017


R

DIAG 0.634 ± 0.012 0.613 ± 0.026 0.582 ± 0.053 0.687 ± 0.049


Shrinkage (β = 0.25) 0.627 ± 0.044 0.565 ± 0.189 0.545 ± 0.185 0.709 ± 0.101
O

Shrinkage (β = 0.5) 0.642 ± 0.010 0.634 ± 0.015 0.620 ± 0.028 0.663 ± 0.028
C

Shrinkage (β = 0.75) 0.641 ± 0.010 0.636 ± 0.012 0.627 ± 0.021 0.655 ± 0.022
N

Days in Advance: 90
U

AdaBoost (×10) 0.633 ± 0.027 0.536 ± 0.089 0.447 ± 0.140 0.818 ± 0.086
AdaBoost (×50) 0.631 ± 0.026 0.535 ± 0.087 0.445 ± 0.137 0.818 ± 0.085
CRDA (λ = 1.0) 0.686 ± 0.006 0.721 ± 0.009 0.813 ± 0.029 0.558 ± 0.026
CRDA (λ = 10.0) 0.675 ± 0.007 0.720 ± 0.006 0.838 ± 0.021 0.512 ± 0.028
CRDA (λ = 100.0) 0.671 ± 0.009 0.719 ± 0.004 0.844 ± 0.028 0.497 ± 0.043
LDA 0.648 ± 0.009 0.648 ± 0.018 0.651 ± 0.037 0.644 ± 0.025
Logistic Regression 0.628 ± 0.028 0.520 ± 0.095 0.427 ± 0.146 0.828 ± 0.090
De-Sparse (λ = 1.0) 0.687 ± 0.009 0.713 ± 0.014 0.778 ± 0.033 0.597 ± 0.022
De-Sparse (λ = 10.0) 0.683 ± 0.006 0.725 ± 0.008 0.839 ± 0.021 0.527 ± 0.018
De-Sparse (λ = 100.0) 0.673 ± 0.008 0.720 ± 0.005 0.841 ± 0.024 0.505 ± 0.037
SVM 0.666 ± 0.009 0.672 ± 0.014 0.687 ± 0.030 0.644 ± 0.023
DIAG 0.635 ± 0.015 0.621 ± 0.030 0.601 ± 0.053 0.668 ± 0.032
Shrinkage (β = 0.25) 0.638 ± 0.012 0.631 ± 0.027 0.621 ± 0.051 0.656 ± 0.032
Shrinkage (β = 0.5) 0.642 ± 0.011 0.635 ± 0.026 0.626 ± 0.050 0.657 ± 0.032
Shrinkage (β = 0.75) 0.641 ± 0.010 0.635 ± 0.024 0.628 ± 0.046 0.655 ± 0.030
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Improving covariance-regularized discriminant analysis...

681 compared to typical LDA. On opposite side of the trade- of Θ from the inverse of sample covariance estimation 730
682 off, when compared to CRDA (based on graphical lasso), (i.e., Θ̄ = Σ̄ −1 ) as 731
683 De-Sparse on average gains 2.3% higher specificity while
R(Θ) = Θ − Θ̄ 2
2 − ΘGT − Θ 2
2, (20)
684 sacrificing 1.4% sensitivity.
where Θ = Θ  for CRDA, Θ = T for De-Sparse, and Θ = 732
685 4.3.3 Discussion on the performance comparison Θβ for Shrinkage LDA. We repeated above (1)–(4) steps for 733
100 times, and illustrated the average error reduction R(Θ) 734
686 We consider testing accuracy and F1-score as two primary in Fig. 2a, with varying parameters and settings. 735
687 metrics for the evaluation, as these two metrics well char- Figure 2a demonstrates that, estimators used in De- 736
688 acterize the performance of classifiers. Thus, we conclude Sparse (T) and CRDA (Θ)  outperform the sample 737
689 that De-Sparse overall outperforms the baseline algorithms, estimation in all settings, while DIAG and Shrinkage 738
690 including both LDA, SVM, Logistic Regression, and other estimators (i.e., Θβ and β = 0.0, 0.25, 0.5, and 0.75) 739
691 classifiers, in all experimental settings. In terms of the may cause even higher estimation error (with negative 740
692 trade-off between sensitivity and specificity, we argue that error reduction) when the number of samples increases. 741
693 De-Sparse still outperforms the original LDA classifier and Figure 2b illustrates the trend of error reduction with CRDA 742
694 CRDA classifiers, considering the requests of predictive and De-Sparse. Though the difference between these two 743
695 analytics of diseases and the early diagnosis. While the

F
Q3 696 original LDA classifier well-balances the sensitivity and

O
697 specificity, both CRDA and De-Sparse would incorporate
698 slightly higher sensitivity, compared to the original LDA,

O
Q4 699 while having lower specificity. In this way, CRDA and

PR
700 De-Sparse could discover more patients potentially with
701 the diseases, but also slightly raise the frequency of false
702 alarms. We believe, compared to the marginal increase
D
703 of false alarms, the improvement of sensitivity should be
TE

704 appreciated in medical contexts. Compared De-Sparse to


705 CRDA, the de-sparsified Graphical Lasso here helps De-
706 Sparse achieve higher overall accuracy and F1-score with a
EC

707 more balanced pair of sensitivity and specificity.


R

708 4.4 Empirical convergence of parameter estimation


R

709 We hypothesize De-Sparse improves LDA because that


O

710 the de-sparsified Graphical Lasso used in De-Sparse (a)


C

711 approaches to the inverse of population covariance matrix


N

712 has a tighter error bound than the inverse of sample


713 covariance matrix used in simple LDA models, when the
U

714 training sample size is limited. In order to verify our


715 hypothesis, we compare the inverse covariance matrix
716 estimators used in De-Sparse, CRDA, and other LDA
717 baselines, using the EHR data. Specifically, we (1) learned
718 a “ground truth” covariance matrix ΣGT (and its inverse
−1
719 ΘGT = ΣGT ) using diagnosis-frequency vectors of
720 10,000 patients (w/o the target disease, balanced) randomly
721 retrieved from all patients of 22 U.S. university healthcare
722 systems, (2) randomly selected another 50 to 250 samples
723 (w/o the target disease, balanced) to train LDA, De-Sparse,
724 CRDA and Shrinkage, (3) estimated the error between
725 the inverse covariance matrix (denoted as Θ, Θ = Θ 
726 
for CRDA, Θ = T for De-Sparse, and Θ = Θβ
727 for Shrinkage LDA) learned in each classifier versus the (b)
728 inverse of “ground truth” covariance matrix ΣGT , all in Fig. 2 Comparison on 2 -norm Estimator Error Reduction of Inverse
729 2 -norm, and (4) further estimated the error reduction Covariance Matrices Estimation (the higher the better)
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Sijia Yang et al.

744 algorithms is not visible in such scale, we can observe that 5. Krzanowski WJ, Jonathan P, McCarthy WV, Thomas MR (1995) 791
745 these two algorithms achieve the maximal error reduction Discriminant analysis with singular covariance matrices: methods 792
and applications to spectroscopic data. Appl Stat, pp 101–115 793
746 when number of samples is 150 in our experiments, while
6. Belhumeur PN, Hespanha JP, Kriegman DJ (1996) Eigenfaces vs. 794
747 the error reduction is low when the number of samples fisherfaces: Recognition using class specific linear projection. In: 795
748 is relatively small (50) or large (250). Because, when the ECCV (1), vol 1064. Springer, pp 45–58 796
749 sample size is small, both sample-based estimation (Θ̄) and 7. Ye J, Janardan R, Li Q (2004) Two-dimensional linear discrimi- 797

750 the regularized estimation (T and Θ)


 work poorly, though T nant analysis. In: NIPS, Cambridge, MA, USA, pp 1569–1576 798
8. Tikhonov AN (1943) On the stability of inverse problems. In: 799
751 
and Θ still outperform Θ̄. With the increasing sample size, Dokl. Akad. Nauk SSSR, vol 39, pp 195–198 800
752 the advantage of CRDA and De-Sparse becomes more and 9. Witten DM, Tibshirani R (2009) Covariance-regularized regres- 801
753 more significant. However, when sample size is large, both sion and classification for high dimensional problems. J Royal Stat 802
Soc: Series B (Statistical Methodology) 71(3):615–636 803
754 sample-based estimation and the regularized estimation
10. Clemmensen L, Hastie T, Witten D, Ersbøll B (2011) Sparse 804
755 converge well, thus the error reduction becomes marginal. discriminant analysis. Technometrics, 53(4) 805
11. Shao J, Wang Y, Deng X, Wang S, et al. (2011) Sparse linear 806
discriminant analysis by thresholding for high dimensional data. 807
Ann Stat 39(2):1241–1265 808
756 5 Conclusion 12. Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse 809
covariance estimation with the graphical lasso. Biostatistics 810
757 In this paper, we study the long existing problem of 9(3):432–441 811

F
758 covariance-regularized discriminant analysis for classifica- 13. Cai TT, Ren Z, Zhou HH, et al. (2016) Estimating structured high- 812
dimensional covariance and precision matrices: Optimal rates and 813

O
759 tion under high-dimensional low sample sizes (HDLSS)
adaptive estimation. Electronic Journal of Statistics 10(1):1–59 814
760 settings. More specific, we take care of the applications to

O
14. Zollanvari A, Dougherty ER (2013) Random matrix theory in 815
761 the predictive analytics of diseases using Electronic Health pattern classification An application to error estimation. In: 2013 816

PR
762 Records (EHRs) data and common diagnosis-frequency Asilomar Conference on Signals, Systems and Computers 817

763 data representation. To understand the performance of LDA, 15. Marčenko VA, Pastur LA (1967) Distribution of eigenvalues for 818
some sets of random matrices. Mathematics of the USSR-Sbornik 819
764 we extend the existing theory [14, 25] and propose a novel 1(4):457 820
D
765 analytical model characterizing the error rate of LDA clas- 16. Iain M (2001) Johnstone. On the distribution of the largest 821
TE

766 sification under the uncertainty of parameter estimation. eigenvalue in principal components analysis. Annals of Statistics, 822
767 Based on the analytical model, we propose De-Sparse – a pp 295–327 823
17. Rothman AJ, Bickel PJ, Levina E, Zhu J, et al. (2008) Sparse 824
768 novel LDA classifier using de-sparsified Graphical Lasso.
EC

permutation invariant covariance estimation. Electron J Stat 825


769 Our analysis shows that the proposed algorithm could out- 2:494–515 826
770 perform the existing Covariance-regularized discriminant 18. Yadav P, Steinbach M, Kumar V, Simon G (2018) Mining 827
R

771 analysis (CRDA) based on common Graphical Lasso. The electronic health records (ehrs) a survey. ACM Computing 828
Surveys (CSUR) 50(6):1–40 829
experimental results on real-world Electronic Health Record
R

772
19. Wang F, Sun J (2015) Psf: A unified patient similarity evaluation 830
773 (EHR) datasets show De-Sparse outperforms all baseline framework through metric learning with weak supervision. IEEE 831
O

774 algorithms. We interpret the comparison of results and J Biomed Health Informatics 19(3):1053–1060 832
C

775 demonstrate the advantage of proposed methods in medi- 20. Sun J, Wang F, Hu J, Edabollahi S (2012) Supervised patient 833
similarity measure of heterogeneous patient records. ACM 834
N

776 care settings. Further, the empirical studies on estimator SIGKDD Explorations Newsletter 14(1):16–24 835
Q5
777 comparison validate our analysis. 21. Ng K, Sun J, Hu J, Wang F (2015) Personalized predictive
U

836
modeling and risk factor identification using patient similarity. 837
AMIA Summit on Clinical Research Informatics (CRI) 838
22. Zhang J, Xiong H, Huang Y, Wu H, Leach K, Barnes L (2015) 839
778 References MSEQ Early detection of anxiety and depression via temporal 840
orders of diagnoses in electronic health data. In: 2015 International 841
779 1. Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd Conference on Big Data (Workshop), IEEE 842
780 edn. Wiley, Hoboken 23. Jensen S, SPSS UK (2001) Mining medical data for predictive 843
781 2. Peck R, Ness JV (1982) The use of shrinkage estimators in and sequential patterns: Pkdd 2001. In: Proceedings of the 5th 844
782 linear discriminant analysis. IEEE Trans Pattern Anal Mach Intell European conference on principles and practice of knowledge 845
783 5:530–537 discovery in databases 846
784 3. Xiong H, Cheng W, Bian J, Hu W, Sun Z, Guo Z (2018) DBSDA 24. Liu C, Wang F, Hu J, Xiong H (2015) Temporal phenotyping from 847
785 Lowering the bound of misclassification rate for sparse linear longitudinal electronic health records: A graph based framework. 848
786 discriminant analysis via model debiasing. IEEE Trans Neural In: Proceedings of the 21th ACM SIGKDD International 849
787 Netwo Learning Sys 30(3):707–717 Conference on Knowledge Discovery and Data Mining, KDD ’15. 850
788 4. Buhlmann P, Van De Geer S (2011) Statistics for high- ACM, New York, pp 705–714 851
789 dimensional data: methods, theory and applications. Springer, 25. Lachenbruch PA, Mickey RM (1968) Estimation of error rates in 852
790 Berlin discriminant analysis. Technometrics 10(1):1–11 853
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Improving covariance-regularized discriminant analysis...

854 26. Bian J, Barnes L, Chen G, Xiong H (2017) Early detection 43. Shickel B, Tighe PJ, Bihorac A, Rashidi P (2017) Deep ehr: 917
855 of diseases using electronic health records data and covariance- a survey of recent advances in deep learning techniques for 918
856 regularized linear discriminant analysis. In: IEEE EMBS Inter- electronic health record (ehr) analysis. IEEE J Biomed Health 919
857 national Conference on Biomedical & Health Informatics (BHI), Informatics 22(5):1589–1604 920
858 p 2017 44. Solares JRA, Raimondi FED, Zhu Y, Rahimian F, Canoy D, Tran 921
859 27. Jankova J, van de Geer S et al (2015) Confidence intervals for J, Gomes ACP, Payberah AH, Zottoli M, Nazarzadeh M, et al. 922
860 high-dimensional inverse covariance estimation. Electronic J Stat (2020) Deep learning for electronic health records: A comparative 923
861 9(1):1205–1229 review of multiple deep neural architectures. J Biomed Inform 924
862 28. Turner JC, Keller A (2015) College Health Surveillance Network: 101:103337 925
863 Epidemiology and Health Care Utilization of College Students at 45. HCUP (2014) Appendix a - clinical classification software- 926
864 U.S. 4-Year Universities. Journal of American College Health, pp diagnoses 927
865 530–538 46. Sun L, Zhang X, Qian Y, Xu J, Zhang S (2019) Feature selection 928
866 29. Van Vleck TT, Elhadad N (2010) Corpus-based problem selection using neighborhood entropy-based uncertainty measures for gene 929
867 for ehr note summarization. In: AMIA Annual symposium expression data classification. Inform Sci 502:18–41 930
868 proceedings, vol 2010, p 817. American Medical Informatics 47. Sun L, Zhang X, Qian Y, Xu J, Zhang S, Tian Y (2019) Joint 931
869 Association neighborhood entropy-based gene selection method with fisher 932
870 30. Yu S, Berry D, Bisbal J (2011) Performance analysis and score for tumor classification. Appl Intell 49(4):1245–1259 933
871 assessment of a tf-idf based archetype-snomed-ct binding 48. Chen L, Wang S (2012) Automated feature weighting in naive 934
872 algorithm. In: 2011 24th International Symposium on Computer- bayes for high-dimensional data classification. In: Proceedings 935
873 Based Medical Systems (CBMS). IEEE, pp 1–6 of the 21st ACM International conference on information and 936
874 31. Shen F, Sohn S, Rastegar-Mojarad M, Liu S, Pankratz JJ, knowledge management, pp 1243–1252 937

F
875 Hatton MA, Sowada N, Shrestha OK, Shurson SL, Liu H (2017) 49. Wan H, Wang H, Guo G, Wei X (2017) Separability-oriented 938
876 Populating physician biographical pages based on EMR data. subclass discriminant analysis. IEEE Trans Pattern Anal Mach 939

O
877 AMIA Summits on Translational Science Proceedings 2017:522 Intell 40(2):409–422 940

O
878 32. Luhn HP (1957) A statistical approach to mechanized encoding 50. Yang X, Jiang X, Tian C, Wang P, Zhou F, Fujita H (2020) 941
879 and searching of literary information. IBM J Res Dev 1(4):309– Inverse projection group sparse representation for tumor classifi- 942

PR
880 317 cation: A low rank variation dictionary approach. Knowl.-Based 943
881 33. Jones KS (1972) A statistical interpretation of term specificity and Syst 196(21):105768. [Link] 944
882 its application in retrieval. Journal of Documentation 10576805768 945
883 34. Aizawa A (2003) An information-theoretic perspective of tf–idf 51. Xiao Q, Dai J, Luo J, Fujita H (2019) Multi-view manifold 946
D
884 measures. Information Processing & Management 39(1):45–65 regularized learning-based method for prioritizing candidate 947
TE

885 35. Dubberke ER, Reske KA, McDonald LC, Fraser VJ (2006) disease miRNAs. Knowl.-Based Syst 175:118–129. [Link] 948
886 Icd-9 codes and surveillance for clostridium difficile–associated [Link]/science/article/pii/S0950705119301480 949
887 disease. Emerging Infectious Diseases 12(10):1576 52. Marozzi M (2015) Multivariate multidistance tests for high- 950
EC

888 36. Kowsari K, Meimandi KJ, Heidarysafa M, Mendu S, Barnes dimensional low sample size case-control studies. Stat Med 951
889 L, Brown D (2019) Text classification algorithms: A survey. 34(9):1511–1526 952
890 Information 10(4):150 53. Field C (1982) Small sample asymptotic expansions for multivari- 953
R

891 37. Choi E, Bahadori MT, Searles E, Coffey C, Thompson M, Bost J, ate m-estimates. Ann Stat, 672–689 954
892 Tejedor-Sojo J, Sun J (2016) Multi-layer representation learning 54. Blanchard G, Kawanabe M, Sugiyama M, Spokoiny V (2006) 955
R

893 for medical concepts. In: Proceedings of the 22nd ACM SIGKDD Klaus-Robert MÞller In search of non-gaussian components of a 956
894 International conference on knowledge discovery and data mining, high-dimensional distribution. J Mach Learn Res 7(Feb):247–282 957
O

895 pp 1495–1504 55. Zollanvari A, Braga-Neto UM, Dougherty ER (2011) Analytic 958
C

896 38. Zhang J, Kowsari K, Harrison JH, Lobo JM, Barnes LE (2018) study of performance of error estimators for linear discriminant 959
897 Patient2vec: A personalized interpretable deep representation of analysis. IEEE Trans Signal Process 59(9):4238–4255 960
N

898 the longitudinal electronic health record. IEEE Access 6:65333– 56. Banerjee O, El Ghaoui L, d’Aspremont A (2008) Model selection 961
65346 through sparse maximum likelihood estimation for multivariate
U

899 962
900 39. Choi E, Bahadori MT, Le S, Stewart WF, Sun J (2017) gaussian or binary data. J Mach Learn Res 9(Mar):485–516 963
901 Gram: graph-based attention model for healthcare representation 57. Kendler KS, Hettema JM, Butera F, Gardner CO, Prescott CA 964
902 learning. In: Proceedings of the 23rd ACM SIGKDD international (2003) Life event dimensions of loss, humiliation, entrapment, 965
903 conference on knowledge discovery and data mining, pp 787–795 and danger in the prediction of onsets of major depression and 966
904 40. Bai T, Zhang S, Egleston BL, Vucetic S (2018) Interpretable generalized anxiety. Arch Gen Psychiatry 60(8):789–796 967
905 representation learning for healthcare via capturing disease 58. Ye J, Janardan R, Park CH, Park H (2004) An optimization 968
906 progression through time. In: Proceedings of the 24th ACM criterion for generalized discriminant analysis on undersampled 969
907 SIGKDD international conference on knowledge discovery & data problems. IEEE Trans Pattern Anal Mach Intell 26(8):982–994 970
908 mining, pp 43–51 59. Huang SH, LePendu P, Iyer SV, Tai-Seale M, Carrell D, Shah NH 971
909 41. Ma T, Xiao C, Wang F (2018) Health-atm: A deep architecture (2014) Toward personalizing treatment for depression: predicting 972
910 for multifaceted patient health record representation and risk diagnosis and severity. J Am Med Inform Assoc 21(6):1069–1075 973
911 prediction. In: Proceedings of the 2018 SIAM International 60. Altman DG, Bland JM (1994) Diagnostic tests. 1: Sensitivity and 974
912 Conference on Data Mining. SIAM, pp 261–269 specificity. Br Med J 308(6943):1552 975
913 42. Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu
914 PJ, Liu X, Marcus J, Sun M, et al. (2018) Scalable and accurate
915 deep learning with electronic health records. NPJ Digital Medicine Publisher’s note Springer Nature remains neutral with regard to 976
916 1(1):18 jurisdictional claims in published maps and institutional affiliations. 977
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

Sijia Yang et al.

Affiliations
Sijia Yang1 · Haoyi Xiong2 · Kaibo Xu3 · Licheng Wang1 · Jiang Bian4 · Zeyi Sun3

978 Sijia Yang


979 ysjhhh@[Link]
980 Haoyi Xiong
981 [Link]@[Link]
982 Kaibo Xu
983 xukaibo@[Link]
984 Licheng Wang
985 wanglc@[Link]
986 Jiang Bian
987 bj1119@[Link]

F
988 1 School of Cyberspace Security, State Key Laboratory
989 of Networking and Switching, Beijing University of Posts

O
990 and Telecommunications, Haidian, Beijing, China

O
991 2 Department of Computer Science, Missouri University
992 of Science and Technology, Rolla, MO 65409, USA

PR
993 3 Mininglamp Academy of Sciences, Mininglamp Technology,
994 Beijing, 100084, China
995 4 Department of Electrical and Computer Engineering,
D
996 University of Central Florida, Orland, FL 32816, USA
TE
EC
R
R
O
C
N
U
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020

AUTHOR QUERIES

AUTHOR PLEASE ANSWER ALL QUERIES:

Q1. Please check captured keywords if correct.


Q2. Please check affiliations if captured and presented correctly.
Q3. Please check Tables if presented correctly.
Q4. Please provide significance of bold entries found in Tables 1,2,3,4,5, otherwise, remove emphasis.
Q5. Author biographies and Photographs are desired. Please provide if necessary.

You might also like