Article
Article
Please note
Your article will be published Online First approximately one week after receipt of
your corrected proofs. This is the official first publication citable with the DOI.
Further changes are, therefore, not possible.
After online publication, subscribers (personal/institutional) to this journal will have
access to the complete article via the DOI using the URL:
[Link]
If you would like to know when your article has been published online, take advantage
of our free alert service. For registration and further information, go to:
[Link]
Due to the electronic nature of the procedure, the manuscript and the original figures
will only be returned to you on special request. When you return your corrections,
please inform us, if you would like to have these documents returned.
The printed version will follow in a forthcoming issue.
AUTHOR'S PROOF
Schedule
AUTHOR'S PROOF
Applied Intelligence
[Link]
Abstract
Linear Discriminant Analysis (LDA) is a well-known technique for feature extraction and dimension reduction. The
performance of classical LDA, however, significantly degrades on the High Dimension Low Sample Size (HDLSS) data for
the ill-posed inverse problem. Existing approaches for HDLSS data classification typically assume the data in question are
F
with Gaussian distribution and deal the HDLSS classification problem with regularization. However, these assumptions are
O
too strict to hold in many emerging real-life applications, such as enabling personalized predictive analysis using Electronic
Health Records (EHRs) data collected from an extremely limited number of patients who have been diagnosed with or
O
without the target disease for prediction. In this paper, we revised the problem of predictive analysis of disease using
PR
personal EHR data and LDA classifier. To fill the gap, in this paper, we first studied an analytical model that understands
the accuracy of LDA for classifying data with arbitrary distribution. The model gives a theoretical upper bound of LDA
error rate that is controlled by two factors: (1) the statistical convergence rate of (inverse) covariance matrix estimators
D
and (2) the divergence of the training/testing datasets to fitted distributions. To this end, we could lower the error rate by
TE
balancing the two factors for better classification performance. Hereby, we further proposed a novel LDA classifier De-
Sparse that leverages De-sparsified Graphical Lasso to improve the estimation of LDA, which outperforms state-of-the-art
EC
LDA approaches developed for HDLSS data. Such advances and effectiveness are further demonstrated by both theoretical
analysis and extensive experiments on EHR datasets [Link]
R
Keywords Linear discriminant analysis · De-sparsified graphical lasso · Electronic health records ·
R
0 1 Introduction or ill-posed since the data is often with high dimension but 12
N
2 technique for feature extraction and dimension reduction. It such HDLSS problems. For example, Krzanowski et al. pro- 15
3 has been widely used in many applications [2, 3] such as posed a pseudo-inverse LDA to approximate the inverse 16
4 face recognition, image retrieval, etc. Typically, LDA finds covariance matrix, when the sample covariance matrix is 17
5 the projection directions such that for the projected data, singular. However, the accuracy of pseudo-inverse LDA is 18
6 the between-class variance has been maximized relative usually low and not well guaranteed [5]. Another tech- 19
7 to the within-class variance, thus achieving maximum nique to alleviate this problem is a two-stage algorithm, 20
8 discrimination. An intrinsic limitation of classical LDA is i.e., PCA+LDA [6, 7]. More popularly, regularized LDA 21
9 that its objective function requires the nonsingularity of one approaches are proposed to solve the problem and improve 22
10 of the scatter matrices. For many applications, such as the the performance [8]. For example, researchers proposed 23
11 microarray data analysis, all scatter matrices can be singular a series of algorithms to regularize the covariance matrix 24
estimation [2, 5, 9]. The regularized linear discriminant 25
hyperplane was studied in [4, 10, 11]. All regularized 26
Zeyi Sun LDA approaches intend to improve LDA through regular- 27
sunzeyi@[Link]
izing the estimation of key parameters used in LDA, such 28
as the covariance matrix and/or the linear coefficients for 29
Extended author information available on the last page of the article. discrimination. 30
Q2
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
31 One representative regularized LDA approach is Covari- scientists can share and analyze the EHR data to enable 84
32 ance Regularized Discriminant Analysis (CRDA) proposed innovative health services, such as providing computer- 85
33 in [9] based on the sparse inverse covariance estimation assisted diagnosis and offering medication advice. Among 86
34 leveraging Graphical Lasso [12]. CRDA was originally these services, predictive analytics of diseases (or namely 87
35 proposed to estimate the inverse covariance matrix via a early detection of diseases) using patients’ past longitudinal 88
36 shrunken estimator, so as to achieve “superior prediction”. health information of the EHR system, has recently attracted 89
37 Intuitively, through replacing the sample covariance matrix significant attention from research communities. 90
38 used in LDA with the regularized estimation, the HDLSS There has been a series of works [19–24], which 91
39 problem can be well addressed since the regularized esti- attempt to predict future disease of patients, through 92
40 mators usually outperform the sample covariance matrix data mining techniques using EHR data. Prior literature 93
41 estimator [13]. To better elucidate the performance of LDA usually first selected important features, such as diagnosis- 94
42 classifiers with uncertain covariance matrix estimates for frequencies [19], pairwise diagnosis transitions [22], and 95
43 Gaussian data classification, [14] studied a model of error graphs of diagnosis sequences [24], to represent the EHR 96
44 rate by matching the estimated vs. true covariance matri- data of the patients. Then, a wide range of supervised 97
45 ces, and the estimated vs. true means. While it is reasonable learning algorithms were adopted to build predictive models 98
46 to assume that the estimated mean can easily converge to for early disease detection, on top of well-represented EHR 99
47 the population/true mean, the population/true covariance data. 100
F
48 matrix is usually unknown and can be very different with In this paper, we first propose a novel analytical model 101
O
49 the estimated one [13]. For example, the largest eigenvalue for LDA error rate, based on the statistical convergence of 102
50 of the sample covariance matrix, which represents the prin- (inverse) covariance matrix estimators and the divergence 103
O
51 ciple component of the data distribution, is not consistent to the Gaussian distributions. Guided by the proposed 104
PR
52 with the population one and the eigenvectors of the sam- analytical model, we propose a novel LDA classifier 105
53 ple covariance matrix can be almost orthogonal to the truth leveraging the (inverse) covariance matrix estimators with 106
54 under HDLSS [15, 16]. Further, the data for classifica- faster convergence rate. We apply our model to a large-scale 107
D
55 tion is usually Non-Gaussian. Thus, it is highly desirable EHR dataset for the predictive analytics of diseases and 108
TE
56 to develop a new analytical model to characterize the error demonstrate the advantage of the proposed algorithms over 109
57 rate for the data with arbitrary distribution (both Gaussian other state-of-the-arts. Specifically, in this paper, we made 110
58 and Non-Gaussian). Two “known factors” of covariance contributions as follows. 111
EC
61 the sparsity/density of (inverse) covariance matrix estima- classification using LDA models and proposed a novel 113
tors [13]. The sparsity/density is already known once the
R
64 rates reflect the maximal error of estimation, and for some LDA error rate for both Gaussian and Non-Gaussian 116
C
65 estimators, they are well bounded under certain assump- data, with respect to the statistical error of covariance 117
N
66 tions, such as spectral-norm convergence rate of Graphical matrices estimation and the divergence between fitted 118
67 Lasso [17]. Gaussian distribution and the data. 119
U
68 Among a wide range of HDLSS data classification 2. On top of the analytical model, we proposed De-Sparse, 120
69 tasks, in this work, we focus on the problem of using which extends the well-known baseline approach – 121
70 LDA to classify EHR [19] for personalized predictive Covariance Regularized Discrimiant Analysis (CRDA) 122
71 analytics of target disease. EHRs play a critical role [9, 26], using De-sparsified Graphical Lasso [27]. 123
72 in modern health information management and service Theoretical analysis based on the proposed analytical 124
73 innovations. A patient’s EHR contains his/her histories model shows that De-Sparse can bound the maximal 125
74 of medical visit, medication, diagnoses, treatment plans, error rate, under mild conditions. Compared to CRDA, 126
75 allergies and so on as shown in Fig. 1. Per each visit a De-Sparse could achieve lower error rate, due to 127
76 diagnosis record would be updated indicating the disease the faster convergence rate of De-sparsified Graphical 128
77 state, i.e., a set of codes referring to the diseases that Lasso. 129
78 diagnosed at a time of visit. One significant feature is 3. To show the practical contribution of the proposed 130
79 the interchangeability of EHR, as a standard protocol for method, we evaluated De-Sparse extensively through 131
80 medical/health data generation, storage and communication. experiments with large-scale, real-world EHR datasets 132
81 The health information is built and managed by authorized [28]. In the experiments, we used De-Sparse to predict 133
82 institutions in a unified digital format (e.g., ICD-9/10, CPT- the risk of mental health disorders in college students 134
83 9/10 used in EHR standards) such that researchers and from ten U.S. universities, using their EHR data from 135
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
(Disease States) D1 D2 D2
Diagnosis D1 D2 D3 D5
D2 D3 D4 D7
D5 D8
Fig. 1 Medicare visits and Electronic Health Records (EHRs). EHRs the diseases/treatments that have been diagnosed/carried out. One can
of a patient consist of the records of diagnoses and treatments. In this enable the early diagnosis/detection of diseases through classifying the
example, totally m medical visits have been placed. For every medical EHR data, with big data and machine learning techniques
visit, the patient would receive a set of ICD/CPT codes [18] referring
F
136 primary care visits. We compared De-Sparse with seven μ̄− are estimated as the mean vectors of the positive training 162
O
137 baseline algorithms including other regularized LDA samples and negative training samples in the N training 163
O
138 models and downstream classifiers. The evaluation samples, respectively. 164
PR
139 result shows that De-Sparse outperforms all baselines,
140 and further validates our theoretical analysis. Corollary 1 (Fisher’s Discriminant Analysis for Binary 165
Classification [1]) Given all estimated parameters Σ̄, μ̄+ , 166
141 The paper is organized as follows. In Section 2, we review
D
and μ̄− , the FDA model classifies a new data vector x as 167
142 the problem of high-dimensional linear classification using
the result of (2) as follows. 168
TE
145
fΣ̄ (x) = sign log ,
146 analysis (CRDA) based on Graphical Lasso, then present x Σ̄ −1 μ̄− − 12 μ̄ −1
− Σ̄ μ̄− + log π−
147 de-sparsified covariance regularized LDA algorithms, based
(2)
R
150
151 world datasets through extensive experiments. Finally, we π+ and π− refer to the (foreknown) frequencies of positive 170
C
152 conclude the paper in Section 5. samples and negative samples in the whole population. 171
N
153 2 Preliminaries based on the LDA paradigm listed in (2), we use the 173
notations as follows. 174
155 To use Fisher’s Linear Discriminant Analysis (FDA), given fΣ (x) as an LDA classifier with a specific covariance 176
156 N labeled data pairs (x1 , l1 ), (x2 , l2 ), (x3 , l3 ) . . . (xN , lN ) matrix estimator Σ, using the sample estimated mean vec- 177
157 and ∀xi , 1 ≤ i ≤ N refers to a d-dimensional vector, we tors μ̄− and μ̄+ . When Σ = Σ̄, then the classifier f (x) 178
Σ̄
158 first estimate the sample covariance matrix (an symmetric becomes the traditional Fisher’s Linear Discriminant Anal- 179
159 d × d matrix) using maximized likelihood estimator: ysis. When Σ=Θ −1 and Θ is the Graphical Lasso estima- 180
tor [12], then fΘ−1 (x) refers to the covariance regularized 181
1
N
LDA [9, 26]. 182
Σ̄ = (xi − μ̄)(xi − μ̄) , (1) Apparently, the performance of LDA depends on (1) 183
N
i=1
whether the realistic training/testing datasets follow Gaus- 184
160 where μ̄ refers to the d-dimensional mean vector of all N sian distributions and (2) how the mean vectors and inverse 185
161 training samples (x1 , l1 ), (x2 , l2 ) . . . (xN , lN ). Then, μ̄+ and covariance matrices are estimated from the datasets. 186
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
187 2.2 Electronic health records and predictive analytic uncertainty measures. Over the rough set, the same group 237
188 of disease of authors [47] adopted a joint feature selection approach 238
that incorporates neighborhood entropy and the fisher 239
189 Prior to learning a predictive analytic model for certain scores, for tumor classification. Further, some automatic 240
190 diseases, one needs to model the EHR data with a feature weighting paradigm has been proposed to select 241
191 suitable data representation. The most simple yet effective features for gene expression data classification [48]. These 242
192 way to represent EHR data is to use diagnosis-frequency studies demonstrate that the feature selection algorithms 243
193 vector [29–31], which is similar to Term Frequency (TF) could significantly improve the accuracy of HDLSS data 244
194 or Term Frequency-Inverse Document Frequency (TF- classification, while avoiding the full set of features. The 245
195 IDF) approach that deals with traditional NLP data [32– over-reduction problem of LDA has been studied in [49]. 246
196 34]. Given each patient’s EHR data (shown in Fig. 1), In addition to the EHR data, similar regularized projection 247
197 this representation method first retrieves the diagnosis methods have been used for early diagnosis of diseases for 248
198 codes [35] recorded during each visit. Inspired by some biomedical health data [50, 51]. 249
199 Natural Language Processing (NLP) and text mining In terms of methodologies, the most close work to this 250
200 practices [36], researchers also proposed using some deep study is covariance-regularized linear discriminant analysis 251
201 learning based NLP approaches to embed EHR records (CRDA) [9], Graphical Lasso [17], and the de-sparsified 252
202 for data representation learning [37–42]. For example, [38] Graphical Lasso [27]. CRDA regularizes the estimation 253
F
203 studied “Patient2Vec” which embeds patients’ past EHR of (inverse) covariance matrices inside the estimation 254
O
204 records into vectors while preserving structural information of LDA, while improving the performance of LDA for 255
205 for personalized predictive analysis. Bai et al. [40] both prediction and inference. Authors in [26] were the 256
O
206 focuses on the interpolation and interpretability of EHR first to bring CRDA for EHR classification and early 257
PR
207 representation learning, where authors well-balanced the detection of diseases. We included the algorithms in [26] 258
208 performance of predictive analysis and the understanding for comparison and found that De-Sparse outperformed 259
209 to the longitude disease progress of each individual patient, CRDA with higher accuracy and F1-score. Compared to 260
D
210 both using the EHR data with the learned representation. the Graphical Lasso estimator [17] that has been frequently 261
TE
211 Comprehensive surveys could be found in [18, 43, 44]. used to enhance the inverse covariance matrices estimation, 262
212 In our work, we follow the line of research that uses our work followed the ideas of de-biased estimator [52] 263
213 diagnosis-frequency vector of each patient for EHR-based and used de-sparsified Graphical Lasso estimator [27] 264
EC
214 predictive analysis [29–31], as the diagnosis-frequency in to improve the LDA for EHR classification. We would 265
215 a certain duration could well characterize the health status provide a comprehensive discussion on the performance 266
R
216 of patients while the coefficient of LDA can represent the comparisons between Graphical Lasso and de-sparsified 267
significance of every diagnosis code. The frequency of each one from the perspectives of predictive analytics based on
R
217 268
218 diagnosis appearing in all past visits (of the last two years) EHR data and LDA. 269
O
223 exist in all past visits. Note that the dimension d ≥ 15, 000
224 when using original ICD-9 codes, d = 295 even when using In this section, we first introduce the baseline algorithm 272
225 clustered ICD-9 codes [45], while the number of samples for based on Covariance-Regularized Discriminant Analysis 273
226 training N in our experiment is significantly smaller than d. (CRDA) that is derived from [26]. Then, we present the pro- 274
posed algorithm De-Sparse, an extended Covariance Reg- 275
227 2.3 Discussion on preliminaries ularized Discriminant Analysis via De-sparsified Graphical 276
Lasso [27]. Then, using our proposed analytical model of 277
228 In our work, we revisited the linear discriminant analysis LDA error rate, we compare two methods and demonstrate 278
229 as a classifier and learner for High-Dimensional and Low the advantages of De-Sparse. 279
230 Sample Size (HDLSS) settings. Indeed, many efforts have
231 been made in the literature for HDLSS data classification. 3.1 CRDA: The baseline of covariance-regularized 280
232 For example, in addition to LDA-type methods, a number discriminant analysis via graphical lasso inverse 281
233 of feature extraction or variable selection methods have covariance matrix estimator 282
234 been studied. Lin et al. [46] proposed a feature selection
235 algorithm to classify the high-dimensional gene expression Compared to the sample LDA introduced in Section 2, 283
236 data through incorporating the neighborhood entropy-based CRDA [9, 26] was proposed to use 1 -penalized inverse 284
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
285 covariance matrix estimator to replace the inverse of To the end, the classification rule of CRDA is written as 327
286 sample covariance matrix. Given the labeled data pairs for follows 328
287 training (x1 , l1 ), (x2 , l2 ) . . . (xN , lN ), the algorithm first
xΘ μ̄+ − 1 μ̄
2 + Θ μ̄+ + log π+
288 estimates the sample covariance matrix Σ̄ and the sample CRDA(x) = sign log ,
289 mean vectors μ̄+ , μ̄− using the algorithms introduced in xΘ μ̄− − 1 μ̄
− Θ μ̄− + log π−
2
290 Section 2.1. With the sample covariance matrix Σ̄, this (5)
method estimates a sparse inverse covariance matrix Θ
291
which can be viewed as an LDA classifier using Θ −1 329
292 using the Graphical Lasso estimator [12] as follows.
as the covariance matrix. Apparently, the accuracy of 330
CRDA depends on how the covariance matrices and the 331
293 Corollary 2 (Graphical Lasso Estimator [12]) Given the
mean vectors are estimated. We are going to interpret the 332
294 sample estimation of the covariance matrix Σ̄, the Graph-
performance of CRDA in the Section 3.3. 333
295 ical Lasso estimator provides an 1 -regularized sparse
296 positive-definite approximation to the inverse covariance
as follows. 3.2 De-Sparse : The improved algorithm 334
297 matrix (denoted as Θ)
of covariance-regularized LDA via de-sparsified 335
⎛ ⎞ graphical lasso 336
= argmin⎝tr(Σ̄Θ) − log det(Θ) + λ
Θ |Θj k |⎠, (3)
Θ>0 As shown in (3), the estimator of sparse inverse covariance 337
F
j =k
matrix induced 1 -penalization and might hurt the estima- 338
O
298 where Θ > 0 refers to the constraint of symmetric positive tion due to the over-penalization or over-sparsification. To 339
O
299 definiteness (SPD), the term tr(Σ̄Θ) − log det(Θ) refers to address this issue, we proposed a de-sparsified Graphical 340
Lasso estimator [27] to replace the vanilla Graphical Lasso. 341
PR
300 the negative log-likelihood of the optimization objective Θ
301 over the sample estimate Σ̄, the term j =k |Θj k | refers
302 to the sum of absolute values of the non-diagonal elements Corollary 4 (De-sparsified Graohical Lasso [27]) Given 342
and the sample estimation
the Graphical Lasso estimator Θ 343
D
303 in the matrix Θ (which is the same as the 1 -norm of
Σ̄, we consider the inverse of Graphical Lasso Θ −1 as 344
304 Θ without diagonal elements considered), and λ refers to
TE
305 tuning parameter that makes trade-off between the sparsity an approximation to the covariance matrix. In this way, 345
−1 , caused by the sparsity regularizer of
the bias of Θ 346
306 and the fitness to samples. Please refer to [12] for the
EC
307 implementation of the algorithms. Graphical Lasso, for covariance estimation could be written 347
as follows. 348
= Σ̄ − Θ −1
R
312
313 matrix is Θ = Σ −1 . Given N samples x1 , x2 , x3 , . . . , xN =Θ
Bias(Σ̄, Θ) ZΘ
=Θ
Σ̄ Θ
− Θ.
(7)
N
352
315 ple estimate of the covariance matrix here should be through removing the
the Graphical Lasso estimator Θ 353
316 Σ̄ = N1 N i=1 xi xi . potential bias term caused by the sparsity reguarlizer, as 354
317 With the increasing number of samples (N) given and
follows. 355
318 growing number of dimensions of the data (d), the graphical
319 lasso estimate Θ based on the sample covariance matrix T = Θ
− Bias(Σ̄, Θ)
= 2Θ
−Θ
Σ̄ Θ.
(8)
320 converges to the population estimate Θ at the rate of On top of the Graphical Lasso, the de-sparsified Graphical 356
321 Frobenius-norm under mild sparsity conditions, as follows Lasso estimator can efficiently approximate an estimation 357
322 [17]. of inverse covariance matrix using the data with faster 358
convergence rate in a mild condition. 359
−Θ (d + s) log p
Θ F =O , (4)
N Corollary 5 (Statistical Convergence of De-sparsified 360
Graphical Lasso [27]) Suppose the random vector X is with 361
323 where s = max1≤i≤d Θi 0 refers to the maximal degree d-dimensions and zero mean (i.e., X ∈ Rd and E (X) = 0), 362
324 of the graph in Θ, · 0 refers to the 0 -norm of the input where the population estimate of the covariance matrix is 363
325 vector, and Θi refers to the i th column vector (1 ≤ i ≤ d) Σ = E (XX ) and the inverse of population covariance 364
326 of the matrix Θi . matrix is Θ = Σ −1 . Given N samples x1 , x2 , x3 , . . . , xN 365
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
366 randomly and independently drawn from X, the sample linear discriminant analysis (i.e., probability of l = fΣ (x)) 405
367 estimate of the covariance matrix here should be Σ̄ = on the data of N (μ+ , Σ), N (μ− , Σ) is modeled as follows. 406
1 m−1
368
m i=0 xi xi . The Graphical Lasso estimator and the de-
369 sparsified estimator are denoted as Θ and T, respectively. Corollary 6 (RMT-based LDA Error Rate Estimation [14]) 407
370 With the increasing number of samples (N) given and According to the random matrix theory, [14] models the 408
371 growing number of dimensions of the data (d), the De- expectation of classification error rate of LDA (using esti- 409
372 sparsified Graphical Lasso estimator T converges to the mated parameters μ̄+ , μ̄− , and Σ) on Gaussian distri- 410
373 population estimate Θ at the rate of ∞ -norm under mild μ+ ,
butions N (μ+ , Σ) and N (μ− , Σ) as ε(μ̄+ , μ̄− , Σ, 411
374 sparsity conditions, as follows [27]. μ− , Σ), as follows. 412
log d μ+ , μ− , Σ)
ε(μ̄+ , μ̄− , Σ,
Θ −Θ ∞ =O . (9)
N
(μ+ − (μ̄+ +2 μ̄− ) ) Σ
−1 (μ̄+ − μ̄− )
= π+ · Φ −
375 Note that, above convergence rate of de-sparsified Graphi- (μ̄+ − μ̄− ) Σ −1 Σ Σ −1 (μ̄+ − μ̄− )
376 cal Lasso was obtained under similar sparsity assumptions (μ̄+ +μ̄− ) −1
(μ− − ) Σ (μ̄+ − μ̄− )
377 as [17], while the 2 -norm or ∞ -norm convergence rates +π− · Φ 2
378 of Graphical Lasso are not known yet. (μ̄+ − μ̄− ) Σ −1 Σ Σ −1 (μ̄+ − μ̄− )
(11)
F
Based on Notations, we denote the De-sparsified
O
379
380 Covariance Regularized Discriminant Analysis (namely De- where Φ refers to the CDF function of a standard normal 413
O
381 Sparse) as Desparse(x), using the De-sparsified Graphical distribution. 414
Lasso T and the mean vectors μ̄+ , μ̄− .
PR
382
384 T enjoying better statistical properties, De-Sparse is adopt the settings in studies [2, 5, 25] as follows. 421
385 expected to outperform CRDA with better classification
R
386 accuracy. Detailed comparison will be discussed in the Assumption 1 In this paper, we make no assumptions on 422
the mean vectors μ+ , μ− , μ and always use the sample
R
388 3.3 Performance analysis of LDA, CRDA, HDLSS settings, with a certain number of samples, it is 425
C
389 and De-Sparse reasonable to assume the sample estimation of mean vectors 426
N
μ̄+ and μ̄− should be close to the population mean vectors, 427
390 In this section, we first review the previous studies on i.e., |μ+ − μ̄+ | → 0, |μ− − μ̄− | → 0, and |μ − μ̄| → 0. 428
U
391 the LDA error rate estimation for Gaussian data [14, 25],
392 then we generalize LDA error rate from Gaussian data to Lemma 1 Thus, based on Theorem 1 and the sample mean 429
393 non-Gaussian data. Finally, we provide a discussion on relaxation (Assumption 1), the expected error rate of fΣ (x) 430
394 the classification accuracy comparison among vanilla LDA, can be reduced to 431
395 CRDA, and De-Sparse.
−1 (μ̄+ − μ̄− )
(μ̄+ − μ̄− ) Σ
396 3.3.1 LDA error rate for Gaussian data via random matrix
ε(Σ, Σ) = Φ − .
−1 Σ Σ
2 (μ̄+ − μ̄− ) Σ −1 (μ̄+ − μ̄− )
397 theory
(12)
398 We first assume the data for binary classification follow two
399 (unknown) Gaussian distributions with the same covariance In this way, to improve the LDA classifier with the sample 432
400 matrix Σ but two different means μ+ and μ− , i.e., mean vectors, there needs an estimator Σ to minimize or 433
401 N (μ+ , Σ) for positive samples and N (μ− , Σ) for negative
lower the expected error rate ε(Σ, Σ). 434
402 samples, respectively. Given the LDA classifier fΣ (x)
403 based on the sample estimated mean vectors μ̄− , μ̄+ and a Lemma 2 Furthermore, when the estimated covariance 435
404 specific covariance matrix Σ, the expected error rate of a is set to the oracle one Σ (the LDA is perfectly
matrix Σ 436
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
437 fitted with the data), the expected error rate reaches the data distribution with density functions P (x, l) and l ∈ 471
438 optimal error rate {+1, −1}, can be written as: 472
(μ̄+ − μ̄− ) −1 (μ̄+ − μ̄− )
Σ err(μ̄+ , μ̄− , Σ̄) = 1[l = fΣ (x)] d P (x, l).
ε(Σ, Σ)
=Φ − . (13)
2
Consider the error rate on Gaussian distribution ε(Σ, Σ ∗ ), 473
439 Above result suggests that when the covariance matrix Σ the conditional probability based on Gaussian distribution 474
440 → Σ and Σ
is perfectly estimated Σ −1 Σ → I , the LDA PΣl∗ (x|l), and the conditional probability P (x, l) = P (x|l)· 475
441 classifier would approach to its optimal error rate. πl . 476
∗
442 The estimate in (12) reduces the estimation of LDA ε(Σ, Σ ) + πl P (x|l) − PΣl∗ (x|l)) dx
l∈{+1,−1} x
443 classification error rate to the divergence between the
444 population covariance matrix Σ and the estimated one Σ. Consider Pinsker’s inequality to bound the divergence. 477
445 On the other hand, (13) models the generalization error of a
DKL (Xl N(μl ,Σl∗ ) )
446 model with “perfectly-fitted” covariances [14]. ≤ ε(Σ̂, Σ ∗ ) + πl
2
l∈{−1,+1} 478
447 3.3.2 LDA error rate for non-Gaussian data via
Kullback-Leibler divergence
F
448
Lemma 3 suggests that we can consider any distribution 479
O
as the combination of its nearest Gaussian distribution 480
We deliver the analysis on LDA classification error rate
(i.e., N (μ̄, Σ ∗ )) and other non-Gaussian components [54].
449
O
481
450 through incorporating additional assumptions as follows.
Given an LDA classifier fΣ̂ , the error rate is upper 482
PR
bounded by two factors: (1) the error rate of fΣ̂ on the 483
451 Assumption 2 Suppose every positive sample x+ and
nearest Gaussian distribution of the data, i.e., ε(Σ̂, Σ ∗ ), 484
452 negative sample x− are realized from random variables X+
and (2) the divergence between the data distribution and the 485
and X− respectively. We denote μ+ = E [X+ ] and μ− =
D
453
Gaussian distribution. Considering Lemma 1, we can further 486
E [X− ] as the expectations of X+ and X− respectively. In
TE
454
∗ and Σ ∗ as the oracle covariance conclude that two factors: the divergence of the given data 487
455 this way we define Σ+ −
to the Gaussian distribution DKL (Pl N (μ̄l , Σ ∗ )) and the 488
456 matrices for two classes respectively, such that −1
statistical convergence of Σ ∗ to Σ −1 would affect the
EC
489
∗
Σ+ = E (X+ − μ+ ) (X+ − μ+ ) and performance of LDA on classification accuracy. 490
R
∗
Σ− = E (X− − μ− ) (X− − μ− ) . 3.3.3 Performance Comparisons 491
R
∗
further assume Σ ≈ Σ+∗ ≈ Σ∗ .
459 − as to understand the accuracy of De-Sparse and CRDA. 494
N
464 as:
ters μ̄+ , μ̄− , Σ) specifically, [14] demonstrated the connections between the 499
Gaussian data error terms and the spectral properties of T 500
⎛ ⎞ Considering Lemma 2, we hope to understand (1)
DKL (Xl N(μl ,Σl∗ ) ) and Θ. 501
⎝ε(Σ,
err(μ̄+ , μ̄− , Σ) Σ ∗) + πl ⎠ how close the matrices TΣ ∗ and ΘΣ ∗ would approach 502
2
l∈{−1,+1} to I matrix and (2) how the spectrum of these matrices 503
(14) behaves [55], such that 504
465 where DKL (Xl N (μl , Σ ∗ )) refers to the Kullback–Leibler TΣ ∗ − I 2 = (T − Θ ∗ )Σ ∗ 2
466 divergence between the distribution of the data Xl and ≤ λmax (Σ ∗ ) T − Θ ∗ 2 , and
467 Gaussian distribution N (μ̄l , Σl∗ ). ∗−I − Θ ∗ )Σ ∗ 2
ΘΣ 2 = (Θ
− Θ ∗ 2,
≤ λmax (Σ ∗ ) Θ (15)
468 Proof To prove Lemma 3, we first define the error function
469 1[l = fΣ (x)] = 1: when l = fΣ (x) and 1[l = fΣ (x)] = 0 where λmax (·) refers to the largest eigenvalue of the input 505
470 when 1= fΣ (x). Then, the error rate of fΣ (x), for any matrix. Obviously, the terms of λmax (Σ ∗ ), T − Θ ∗ 2 and 506
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
507 Θ − Θ ∗ 2 are non-negative. When T − Θ ∗ 2 → 0 and develop into the target disease via classification of the 547
508 Θ−Θ ∗ 2 → 0, then optimal error rates would be achieved vector x to +1 (diagnosed as the positive result) or −1 548
509 asymptotically. In this way, we are wondering whether T (diagnosed as the negative result). 549
510 would converge to Θ ∗ faster than Θ in the spectrum-norm – Performance Metrics - To demonstrate the effective- 550
511 distance. ness of predictive analytics of diseases, we compared 551
512 Considering Corollaries 3 and 5, the sharp spectrum- all these methods with other competitors using Accu- 552
513 norm statistical convergence rate of Graphical Lasso is not racy and F1-score. Specifically, Accuracy characterizes 553
514 known in any of the previous studies [13, 17, 27, 56], the proportion of patients who are accurately classified 554
515 while the spectrum-norm statistical convergence rate of de- by the algorithms. F1-Score measures both correct- 555
516 sparsified Graphical Lasso could be easily derived from the ness and completeness of the prediction. Of-course, 556
517 ∞ -norm rate. In this way, we compare CRDA and De- we also include other metrics such as sensitivity and 557
518 Sparse through the 2 -norm statistical convergence rate of specificity to evaluate the performance of predictive 558
519 their inverse covariance matrix estimators. Thus, we can analytics through addressing the medical concerns. 559
520 derive the 2 -norm convergence rate as: – Data for Evaluation - In the experiments, we use 560
the de-identified EHR data of 200,000 students from 561
d log d ten U.S. universities [28]. Among all diseases recorded,
T − Θ 2 = O
∗
. (16) 562
N we choose mental health disorders, including anxiety 563
F
disorders, mood disorders, depression disorders, and 564
521 On the other side, [17] demonstrated that the 2 -norm
O
other related disorders, as one targeted disease for early 565
522 is
convergence rate for the Graphical Lasso estimator Θ detection [57]. We represent each patient using his/her 566
O
diagnosis-frequency vector based on the clustered 567
(d + s) log d
PR
−Θ 2 = O
Θ ∗
, codeset (d = 295). 568
N
Note that, to prepare the training and testing sets, we use 569
where s has been defined in Corollary 3. We conclude the the complete EHR data of the patients who haven’t been 570
D
523
524 convergence rate of T is faster than Θ. In this way, we diagnosed with any of mental health disorders (negative 571
TE
525 consider De-Sparse would outperform CRDA as it adopts a samples). For patients having been diagnosed with any 572
526 better inverse covariance matrix estimator. mental health disorders (positive samples), we collect their 573
EC
EHR data from the first visit to the last visit that was 90 574
days before the diagnosis of mental health disorders. Thus, 575
527 4 Experiments we can simulate the early detection of diseases with 90 days 576
R
in advance. 577
R
529 experiments to evaluate the superiority of the proposed 4.2 Design of experiment 578
De-Sparse framework. Then, we present the experimental
C
530
531 results, including the performance comparison between To understand the performance impact of De-Sparse 579
N
532 the De-Sparse framework, existing LDA baselines, and beyond classic LDA, we first propose four LDA baseline 580
U
533 other predictive models, followed by a comparison between approaches to compare against De-Sparse, then, three 581
534 inverse covariance matrix to support our theoretical analysis discriminative learning models are used for the comparison: 582
535 of De-Sparse.
– LDA Derivatives: LDA, Shrinkage, DIAG and Ye-LDA – 583
The first three algorithms are all based on the common 584
536 4.1 Experiment setups implementation of generalized Fishier’s discriminant 585
analysis listed in (2). Specifically, LDA uses the sample 586
537 In this study, to evaluate De-Sparse, we use predictive
covariance estimation, and inverts the covariance matrix 587
538 analytics of disease based on Electronic Health Records
using pseudo-inverse [58] when the matrix inverse is 588
539 (EHR) data.
not available; Shrinkage is based on LDA, using a sparse 589
540 – Predictive Analytics of Diseases - Given N training estimation of sample covariance as follows, 590
541 samples (i.e., the EHR data of each patient) along with
542 corresponding labels i.e., (x1 , l1 ), (x2 , l2 ) . . . (xN , lN ) Σβ = β ∗ Σ̄ + (1 − β) ∗ diag(Σ̄) and Θβ = Σβ−1 , (17)
543 where li ∈ {−1, +1} refers to whether the patient i is
544 diagnosed with the target disease or not (i.e., positive where diag(Σ̄) refers to the diagonal matrix of the 591
545 sample or negative sample), the predictive analytics sample estimation Σ̄. DIAG is a special Shrinkage 592
546 task is to determine whether a new patient would approach with β = 0.0. Ye-LDA is derived from [7, 593
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
594 58]. In our research, we focus on studying the improve- 4.3.1 Comparisons on accuracy and F1-score 641
595 ment of LDA classification caused by (inverse) covari-
596 ance matrix regularization, thus we don’t compare our As can be seen from the results in Tables 1, 2, 3, 4, and 5, 642
597 method to linear-coefficient-regularized LDA classi- De-Sparse clearly outperforms the baseline algorithms in 643
598 fiers [4, 10, 11] or heuristic LDA derivation [6]. terms of overall accuracy and F1-score. Specifically, De- 644
599 – Downstream Classifiers: Support Vector Machine Sparse achieves 18.6%–21.3% increase in accuracy and 645
600 (SVM), Logistic Regression (Logit. Reg.) and AdaBoost 22.9%–32% increase in F1-score over LDA; De-Sparse 646
601 – Inspired by the previous studies [21, 59] in EHR data achieves 17.9% increase in accuracy and 31.5%–40.6% 647
602 mining, we use a linear binary SVM classifier with increase in F1-score over DIAG. Compared to Shrinkage 648
603 fine-tuned parameters as well as the Logistic Regres- and CRDA, the accuracy and F1-score of De-Sparse in 649
604 sion classifier. Further, we compare our algorithm to most parameter settings are 0.3%–18.9% higher and 0.14%– 650
605 AdaBoost, where AdaBoost-10 refers to the AdaBoost 71.8% higher, respectively. Compared to SVM, Logistic 651
606 classifier using 10 Logistic Regression instances and Regression, and AdaBoost, De-Sparse can achieve 2.3%– 652
607 AdaBoost-50 leverages 50 instances. 19.4% higher accuracy and 7.5%–43.5% higher F1-score. 653
In this case, we can conclude that the classic LDA 654
608 With the seven baseline algorithms, we perform exper-
model cannot perform as well as many other predictive 655
609 iments with training sets of varying sizes and cross-
models such as SVM and AdaBoost. However, De-Sparse 656
610 validation. To train the classifiers, we randomly selected
F
significantly outperforms these methods in all settings. 657
611 50, 100, 150, 200, and 250 positive patients, and randomly
O
Thus, we can conclude that De-Sparse overall outperforms 658
612 selected the same number of negative patients. Then, we
the baseline algorithms in all experimental settings. Note 659
O
613 test the classifiers, using a testing set with 1000 randomly
that, though De-Sparse outperforms CRDA marginally, De- 660
614 selected positive patients and the same number of negative
PR
Sparse enjoys a more tight upper bound of error rate. 661
615 patients. Note that there is no over-lap between training set
616 and the paired testing set. All algorithms used in our work
4.3.2 Trade-off between sensitivity and specificity 662
617 are implemented with JSAT 1 and glasso in R 2 .
D
TE
622 are done with cross validation using random sampling measures the ability of the prediction algorithms to 668
R
623 without replacement and repeated 10 times. Specifically, correctly identify the patients with the disease (true 669
O
624 we compare the performance using various experimental positive rate). More specific, we estimate sensitivity as 670
625 settings, such as the varying parameters for model training
C
days, 60 days and 90 days). We carry out the experiments # Patients with the diseases ∩ Patients predicted as positive
627 = .
U
628 with varying Days in Advance settings, so as to evaluate #Patients with the diseases
(18)
629 the performance of algorithms for predictive analytics. As
630 was addressed, we actually need to use the past EHR data 671
631 (before the diagnoses of mental disorders) as the features – Specificity - In contrast, the specificity metric 672
632 for prediction. More specific, for positive samples in both characterizes the ability of the algorithms to correctly 673
633 training and testing datasets, we backtracked their EHR identify ones without the disease (true negative rate). 674
634 data from their prediction dates. For every positive sample, More specific, we estimate specificity as 675
Table 1 Performance
Comparison with Training Accuracy F1-Score Sensitivity Specificity
Set:50× 2, Testing Set:
2000×2 Days in Advance: 30
AdaBoost (×10) 0.637 ± 0.028 0.571 ± 0.057 0.491 ± 0.085 0.783 ± 0.053
AdaBoost (×50) 0.640 ± 0.024 0.570 ± 0.061 0.487 ± 0.093 0.792 ± 0.053
CRDA (λ = 1.0) 0.662 ± 0.017 0.692 ± 0.028 0.762 ± 0.069 0.563 ± 0.058
CRDA (λ = 10.0) 0.670 ± 0.017 0.713 ± 0.010 0.819 ± 0.023 0.520 ± 0.047
CRDA (λ = 100.0) 0.664 ± 0.020 0.713 ± 0.008 0.834 ± 0.033 0.494 ± 0.068
LDA 0.555 ± 0.026 0.565 ± 0.033 0.579 ± 0.048 0.531 ± 0.040
Logistic Regression 0.615 ± 0.055 0.469 ± 0.206 0.395 ± 0.200 0.835 ± 0.094
De-Sparse (λ = 1.0) 0.658 ± 0.019 0.677 ± 0.034 0.723 ± 0.073 0.592 ± 0.050
De-Sparse (λ = 10.0) 0.672 ± 0.015 0.713 ± 0.010 0.813 ± 0.025 0.532 ± 0.042
De-Sparse (λ = 100.0) 0.668 ± 0.018 0.714 ± 0.008 0.830 ± 0.026 0.506 ± 0.056
SVM 0.611 ± 0.026 0.619 ± 0.034 0.632 ± 0.050 0.590 ± 0.029
DIAG 0.568 ± 0.014 0.515 ± 0.026 0.460 ± 0.042 0.676 ± 0.046
Shrinkage (β = 0.25) 0.574 ± 0.014 0.538 ± 0.025 0.499 ± 0.041 0.649 ± 0.045
Shrinkage (β = 0.5) 0.560 ± 0.033 0.438 ± 0.220 0.413 ± 0.210 0.708 ± 0.152
F
Shrinkage (β = 0.75) 0.560 ± 0.025 0.480 ± 0.163 0.448 ± 0.158 0.672 ± 0.118
O
Days in Advance: 60
O
AdaBoost (×10) 0.646 ± 0.021 0.596 ± 0.054 0.531 ± 0.095 0.762 ± 0.057
PR
AdaBoost (×50) 0.639 ± 0.027 0.569 ± 0.083 0.491 ± 0.111 0.788 ± 0.060
CRDA (λ = 1.0) 0.654 ± 0.016 0.690 ± 0.016 0.774 ± 0.067 0.535 ± 0.088
CRDA (λ = 10.0) 0.653 ± 0.019 0.706 ± 0.010 0.833 ± 0.053 0.474 ± 0.083
D
CRDA (λ = 100.0) 0.643 ± 0.024 0.701 ± 0.028 0.844 ± 0.098 0.443 ± 0.124
TE
De-Sparse (λ = 10.0) 0.661 ± 0.016 0.708 ± 0.009 0.823 ± 0.051 0.499 ± 0.077
De-Sparse (λ = 100.0) 0.649 ± 0.021 0.705 ± 0.020 0.844 ± 0.082 0.454 ± 0.110
R
Shrinkage (β = 0.5) 0.567 ± 0.013 0.528 ± 0.038 0.489 ± 0.067 0.646 ± 0.059
C
Shrinkage (β = 0.75) 0.561 ± 0.025 0.477 ± 0.164 0.444 ± 0.163 0.677 ± 0.120
N
Days in Advance: 90
U
AdaBoost (×10) 0.627 ± 0.034 0.572 ± 0.063 0.507 ± 0.091 0.747 ± 0.054
AdaBoost (×50) 0.632 ± 0.035 0.575 ± 0.054 0.504 ± 0.077 0.759 ± 0.058
CRDA (λ = 1.0) 0.641 ± 0.018 0.663 ± 0.041 0.716 ± 0.106 0.566 ± 0.091
CRDA (λ = 10.0) 0.651 ± 0.018 0.693 ± 0.034 0.797 ± 0.093 0.505 ± 0.096
CRDA (λ = 100.0) 0.634 ± 0.040 0.675 ± 0.101 0.808 ± 0.188 0.459 ± 0.173
LDA 0.546 ± 0.025 0.532 ± 0.038 0.518 ± 0.058 0.574 ± 0.046
Logistic Regression 0.597 ± 0.058 0.423 ± 0.217 0.351 ± 0.207 0.843 ± 0.096
De-Sparse (λ = 1.0) 0.642 ± 0.022 0.663 ± 0.035 0.710 ± 0.078 0.574 ± 0.060
De-Sparse (λ = 10.0) 0.658 ± 0.016 0.696 ± 0.022 0.787 ± 0.073 0.528 ± 0.084
De-Sparse (λ = 100.0) 0.641 ± 0.031 0.683 ± 0.081 0.808 ± 0.164 0.475 ± 0.148
SVM 0.597 ± 0.034 0.600 ± 0.036 0.606 ± 0.047 0.587 ± 0.046
DIAG 0.568 ± 0.023 0.514 ± 0.048 0.464 ± 0.074 0.672 ± 0.066
Shrinkage (β = 0.25) 0.569 ± 0.020 0.530 ± 0.041 0.490 ± 0.065 0.648 ± 0.054
Shrinkage (β = 0.5) 0.565 ± 0.021 0.519 ± 0.041 0.473 ± 0.059 0.657 ± 0.044
Shrinkage (β = 0.75) 0.559 ± 0.019 0.511 ± 0.040 0.465 ± 0.061 0.653 ± 0.050
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
Table 2 Performance
Comparison with Training Accuracy F1-Score Sensitivity Specificity
Set:100× 2, Testing Set:
2000×2 Days in Advance: 30
AdaBoost (×10) 0.632 ± 0.029 0.541 ± 0.095 0.452 ± 0.117 0.812 ± 0.065
AdaBoost (×50) 0.631 ± 0.032 0.538 ± 0.099 0.447 ± 0.120 0.814 ± 0.062
CRDA (λ = 1.0) 0.674 ± 0.012 0.708 ± 0.019 0.792 ± 0.043 0.556 ± 0.029
CRDA (λ = 10.0) 0.675 ± 0.006 0.722 ± 0.008 0.844 ± 0.022 0.507 ± 0.017
CRDA (λ = 100.0) 0.664 ± 0.010 0.718 ± 0.004 0.858 ± 0.031 0.469 ± 0.048
LDA 0.594 ± 0.016 0.592 ± 0.019 0.591 ± 0.027 0.597 ± 0.018
Logistic Regression 0.593 ± 0.054 0.394 ± 0.200 0.305 ± 0.180 0.881 ± 0.075
De-Sparse (λ = 1.0) 0.674 ± 0.018 0.700 ± 0.025 0.765 ± 0.050 0.582 ± 0.026
De-Sparse (λ = 10.0) 0.681 ± 0.006 0.724 ± 0.006 0.838 ± 0.018 0.524 ± 0.020
De-Sparse (λ = 100.0) 0.668 ± 0.009 0.720 ± 0.006 0.854 ± 0.028 0.481 ± 0.041
SVM 0.636 ± 0.016 0.642 ± 0.024 0.655 ± 0.044 0.618 ± 0.025
DIAG 0.594 ± 0.019 0.562 ± 0.034 0.524 ± 0.050 0.663 ± 0.033
Shrinkage (β = 0.25) 0.600 ± 0.020 0.582 ± 0.031 0.559 ± 0.045 0.641 ± 0.022
Shrinkage (β = 0.5) 0.581 ± 0.044 0.467 ± 0.235 0.449 ± 0.228 0.714 ± 0.144
F
Shrinkage (β = 0.75) 0.599 ± 0.014 0.582 ± 0.020 0.559 ± 0.029 0.639 ± 0.022
O
Days in Advance: 60
O
AdaBoost (×10) 0.633 ± 0.024 0.537 ± 0.076 0.439 ± 0.110 0.827 ± 0.067
PR
AdaBoost (×50) 0.623 ± 0.024 0.507 ± 0.065 0.396 ± 0.089 0.850 ± 0.052
CRDA (λ = 1.0) 0.676 ± 0.016 0.711 ± 0.015 0.797 ± 0.041 0.555 ± 0.052
CRDA (λ = 10.0) 0.672 ± 0.019 0.719 ± 0.015 0.837 ± 0.025 0.508 ± 0.039
D
CRDA (λ = 100.0) 0.668 ± 0.017 0.716 ± 0.013 0.838 ± 0.038 0.498 ± 0.054
TE
De-Sparse (λ = 10.0) 0.676 ± 0.016 0.720 ± 0.012 0.834 ± 0.026 0.518 ± 0.039
De-Sparse (λ = 100.0) 0.671 ± 0.017 0.718 ± 0.012 0.838 ± 0.029 0.504 ± 0.045
R
Shrinkage (β = 0.5) 0.596 ± 0.035 0.532 ± 0.178 0.513 ± 0.174 0.680 ± 0.113
C
Shrinkage (β = 0.75) 0.596 ± 0.039 0.532 ± 0.179 0.513 ± 0.175 0.678 ± 0.115
N
Days in Advance: 90
U
AdaBoost (×10) 0.626 ± 0.022 0.519 ± 0.061 0.412 ± 0.093 0.840 ± 0.058
AdaBoost (×50) 0.631 ± 0.017 0.523 ± 0.056 0.413 ± 0.087 0.849 ± 0.053
CRDA (λ = 1.0) 0.674 ± 0.013 0.709 ± 0.020 0.796 ± 0.052 0.552 ± 0.047
CRDA (λ = 10.0) 0.674 ± 0.010 0.721 ± 0.006 0.845 ± 0.021 0.502 ± 0.034
CRDA (λ = 100.0) 0.666 ± 0.015 0.719 ± 0.006 0.856 ± 0.025 0.477 ± 0.052
LDA 0.605 ± 0.017 0.607 ± 0.026 0.612 ± 0.045 0.598 ± 0.028
Logistic Regression 0.611 ± 0.036 0.453 ± 0.130 0.345 ± 0.136 0.876 ± 0.067
De-Sparse (λ = 1.0) 0.675 ± 0.013 0.700 ± 0.026 0.764 ± 0.061 0.587 ± 0.045
De-Sparse (λ = 10.0) 0.682 ± 0.007 0.725 ± 0.007 0.840 ± 0.025 0.523 ± 0.030
De-Sparse (λ = 100.0) 0.669 ± 0.013 0.721 ± 0.006 0.853 ± 0.023 0.486 ± 0.046
SVM 0.632 ± 0.017 0.638 ± 0.023 0.649 ± 0.039 0.616 ± 0.026
DIAG 0.597 ± 0.015 0.574 ± 0.039 0.549 ± 0.072 0.644 ± 0.063
Shrinkage (β = 0.25) 0.593 ± 0.034 0.531 ± 0.179 0.517 ± 0.182 0.668 ± 0.120
Shrinkage (β = 0.5) 0.602 ± 0.015 0.589 ± 0.028 0.575 ± 0.053 0.628 ± 0.043
Shrinkage (β = 0.75) 0.599 ± 0.015 0.586 ± 0.025 0.570 ± 0.045 0.629 ± 0.037
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
Table 3 Performance
Comparison with Training Accuracy F1-Score Sensitivity Specificity
Set:150× 2, Testing Set:
2000×2 Days in Advance: 30
AdaBoost (×10) 0.615 ± 0.010 0.484 ± 0.033 0.363 ± 0.039 0.867 ± 0.024
AdaBoost (×50) 0.615 ± 0.007 0.482 ± 0.025 0.359 ± 0.032 0.871 ± 0.023
CRDA (λ = 1.0) 0.682 ± 0.008 0.723 ± 0.008 0.829 ± 0.021 0.534 ± 0.019
CRDA (λ = 10.0) 0.671 ± 0.013 0.721 ± 0.008 0.851 ± 0.016 0.490 ± 0.035
CRDA (λ = 100.0) 0.662 ± 0.014 0.718 ± 0.007 0.861 ± 0.020 0.464 ± 0.044
LDA 0.613 ± 0.012 0.611 ± 0.018 0.610 ± 0.038 0.615 ± 0.037
Logistic Regression 0.581 ± 0.045 0.352 ± 0.189 0.255 ± 0.142 0.908 ± 0.053
De-Sparse (λ = 1.0) 0.681 ± 0.009 0.712 ± 0.012 0.790 ± 0.028 0.572 ± 0.020
De-Sparse (λ = 10.0) 0.681 ± 0.007 0.727 ± 0.006 0.849 ± 0.013 0.512 ± 0.019
De-Sparse (λ = 100.0) 0.667 ± 0.013 0.720 ± 0.007 0.857 ± 0.020 0.478 ± 0.041
SVM 0.650 ± 0.012 0.660 ± 0.014 0.680 ± 0.024 0.620 ± 0.023
DIAG 0.619 ± 0.014 0.610 ± 0.031 0.600 ± 0.056 0.637 ± 0.037
Shrinkage (β = 0.25) 0.599 ± 0.051 0.500 ± 0.251 0.503 ± 0.256 0.696 ± 0.156
Shrinkage (β = 0.5) 0.611 ± 0.039 0.562 ± 0.189 0.566 ± 0.195 0.656 ± 0.121
F
Shrinkage (β = 0.75) 0.615 ± 0.009 0.611 ± 0.024 0.608 ± 0.051 0.623 ± 0.045
O
Days in Advance: 60
O
AdaBoost (×10) 0.625 ± 0.039 0.512 ± 0.131 0.424 ± 0.156 0.826 ± 0.081
PR
AdaBoost (×50) 0.637 ± 0.024 0.554 ± 0.072 0.466 ± 0.113 0.809 ± 0.068
CRDA (λ = 1.0) 0.677 ± 0.017 0.717 ± 0.015 0.818 ± 0.028 0.536 ± 0.032
CRDA (λ = 10.0) 0.671 ± 0.012 0.721 ± 0.008 0.848 ± 0.022 0.494 ± 0.038
D
CRDA (λ = 100.0) 0.662 ± 0.014 0.718 ± 0.006 0.861 ± 0.031 0.463 ± 0.055
TE
De-Sparse (λ = 10.0) 0.678 ± 0.011 0.724 ± 0.009 0.843 ± 0.017 0.513 ± 0.023
De-Sparse (λ = 100.0) 0.667 ± 0.014 0.720 ± 0.007 0.856 ± 0.028 0.477 ± 0.050
R
Shrinkage (β = 0.5) 0.608 ± 0.039 0.548 ± 0.184 0.533 ± 0.181 0.683 ± 0.110
C
Shrinkage (β = 0.75) 0.618 ± 0.015 0.602 ± 0.027 0.581 ± 0.045 0.655 ± 0.033
N
Days in Advance: 90
U
AdaBoost (×10) 0.630 ± 0.023 0.531 ± 0.075 0.436 ± 0.123 0.824 ± 0.082
AdaBoost (×50) 0.630 ± 0.023 0.534 ± 0.078 0.441 ± 0.126 0.820 ± 0.083
CRDA (λ = 1.0) 0.674 ± 0.012 0.708 ± 0.017 0.794 ± 0.045 0.553 ± 0.039
CRDA (λ = 10.0) 0.671 ± 0.011 0.720 ± 0.007 0.845 ± 0.021 0.498 ± 0.035
CRDA (λ = 100.0) 0.663 ± 0.013 0.718 ± 0.004 0.857 ± 0.025 0.470 ± 0.050
LDA 0.611 ± 0.020 0.610 ± 0.025 0.608 ± 0.039 0.614 ± 0.024
Logistic Regression 0.614 ± 0.045 0.463 ± 0.174 0.374 ± 0.180 0.853 ± 0.098
De-Sparse (λ = 1.0) 0.672 ± 0.018 0.693 ± 0.030 0.745 ± 0.065 0.600 ± 0.042
De-Sparse (λ = 10.0) 0.678 ± 0.010 0.722 ± 0.009 0.836 ± 0.026 0.521 ± 0.033
De-Sparse (λ = 100.0) 0.668 ± 0.010 0.720 ± 0.005 0.851 ± 0.022 0.485 ± 0.039
SVM 0.639 ± 0.015 0.645 ± 0.020 0.657 ± 0.035 0.622 ± 0.026
DIAG 0.610 ± 0.012 0.602 ± 0.022 0.590 ± 0.042 0.631 ± 0.031
Shrinkage (β = 0.25) 0.613 ± 0.011 0.608 ± 0.019 0.601 ± 0.036 0.626 ± 0.027
Shrinkage (β = 0.5) 0.602 ± 0.036 0.547 ± 0.183 0.540 ± 0.183 0.665 ± 0.114
Shrinkage (β = 0.75) 0.601 ± 0.036 0.545 ± 0.183 0.536 ± 0.182 0.665 ± 0.113
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
Table 4 Performance
Comparison with Training Accuracy F1-Score Sensitivity Specificity
Set:200× 2, Testing Set:
2000×2 Days in Advance: 30
AdaBoost (×10) 0.618 ± 0.026 0.485 ± 0.082 0.373 ± 0.115 0.863 ± 0.064
AdaBoost (×50) 0.618 ± 0.022 0.491 ± 0.064 0.377 ± 0.092 0.859 ± 0.052
CRDA (λ = 1.0) 0.688 ± 0.006 0.725 ± 0.007 0.824 ± 0.017 0.553 ± 0.016
CRDA (λ = 10.0) 0.680 ± 0.005 0.725 ± 0.005 0.847 ± 0.013 0.513 ± 0.013
CRDA (λ = 100.0) 0.669 ± 0.011 0.721 ± 0.003 0.855 ± 0.026 0.483 ± 0.047
LDA 0.637 ± 0.006 0.644 ± 0.010 0.655 ± 0.021 0.620 ± 0.020
Logistic Regression 0.598 ± 0.046 0.411 ± 0.175 0.313 ± 0.159 0.883 ± 0.070
De-Sparse (λ = 1.0) 0.686 ± 0.007 0.717 ± 0.007 0.794 ± 0.017 0.578 ± 0.019
De-Sparse (λ = 10.0) 0.684 ± 0.006 0.729 ± 0.005 0.850 ± 0.007 0.519 ± 0.010
De-Sparse (λ = 100.0) 0.673 ± 0.009 0.723 ± 0.004 0.852 ± 0.024 0.494 ± 0.038
SVM 0.660 ± 0.012 0.671 ± 0.012 0.693 ± 0.014 0.626 ± 0.015
DIAG 0.623 ± 0.013 0.603 ± 0.024 0.575 ± 0.041 0.671 ± 0.029
Shrinkage (β = 0.25) 0.628 ± 0.013 0.621 ± 0.023 0.610 ± 0.039 0.646 ± 0.024
Shrinkage (β = 0.5) 0.619 ± 0.042 0.565 ± 0.190 0.560 ± 0.190 0.678 ± 0.110
F
Shrinkage (β = 0.75) 0.633 ± 0.012 0.629 ± 0.019 0.624 ± 0.034 0.642 ± 0.022
O
Days in Advance: 60
O
AdaBoost (×10) 0.605 ± 0.023 0.445 ± 0.085 0.325 ± 0.074 0.885 ± 0.033
PR
AdaBoost (×50) 0.616 ± 0.010 0.479 ± 0.038 0.356 ± 0.048 0.876 ± 0.032
CRDA (λ = 1.0) 0.684 ± 0.006 0.721 ± 0.006 0.818 ± 0.019 0.549 ± 0.023
CRDA (λ = 10.0) 0.674 ± 0.008 0.722 ± 0.006 0.844 ± 0.019 0.505 ± 0.026
D
CRDA (λ = 100.0) 0.673 ± 0.010 0.721 ± 0.006 0.845 ± 0.021 0.502 ± 0.035
TE
De-Sparse (λ = 10.0) 0.682 ± 0.006 0.726 ± 0.007 0.844 ± 0.017 0.520 ± 0.014
De-Sparse (λ = 100.0) 0.675 ± 0.008 0.722 ± 0.006 0.843 ± 0.022 0.508 ± 0.031
R
Shrinkage (β = 0.5) 0.620 ± 0.040 0.565 ± 0.189 0.557 ± 0.187 0.683 ± 0.110
C
Shrinkage (β = 0.75) 0.616 ± 0.039 0.557 ± 0.186 0.544 ± 0.183 0.688 ± 0.109
N
Days in Advance: 90
U
AdaBoost (×10) 0.626 ± 0.033 0.507 ± 0.107 0.411 ± 0.153 0.840 ± 0.088
AdaBoost (×50) 0.632 ± 0.028 0.533 ± 0.092 0.441 ± 0.135 0.823 ± 0.080
CRDA (λ = 1.0) 0.682 ± 0.008 0.722 ± 0.008 0.825 ± 0.017 0.540 ± 0.020
CRDA (λ = 10.0) 0.664 ± 0.012 0.718 ± 0.006 0.856 ± 0.025 0.472 ± 0.044
CRDA (λ = 100.0) 0.656 ± 0.016 0.715 ± 0.005 0.865 ± 0.029 0.447 ± 0.058
LDA 0.631 ± 0.014 0.630 ± 0.018 0.631 ± 0.034 0.630 ± 0.032
Logistic Regression 0.605 ± 0.060 0.424 ± 0.232 0.353 ± 0.222 0.857 ± 0.107
De-Sparse (λ = 1.0) 0.684 ± 0.010 0.714 ± 0.014 0.789 ± 0.031 0.579 ± 0.020
De-Sparse (λ = 10.0) 0.676 ± 0.008 0.724 ± 0.004 0.852 ± 0.019 0.500 ± 0.030
De-Sparse (λ = 100.0) 0.658 ± 0.015 0.716 ± 0.005 0.863 ± 0.029 0.452 ± 0.057
SVM 0.657 ± 0.009 0.669 ± 0.015 0.693 ± 0.031 0.621 ± 0.024
DIAG 0.625 ± 0.013 0.614 ± 0.029 0.601 ± 0.055 0.648 ± 0.045
Shrinkage (β = 0.25) 0.627 ± 0.014 0.617 ± 0.030 0.604 ± 0.056 0.651 ± 0.043
Shrinkage (β = 0.5) 0.626 ± 0.013 0.616 ± 0.027 0.603 ± 0.051 0.650 ± 0.042
Shrinkage (β = 0.75) 0.626 ± 0.014 0.617 ± 0.023 0.604 ± 0.045 0.649 ± 0.043
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
Table 5 Performance
Comparison with Training Accuracy F1-Score Sensitivity Specificity
Set:250× 2, Testing Set:
2000×2 Days in Advance: 30
AdaBoost (×10) 0.620 ± 0.037 0.484 ± 0.110 0.380 ± 0.147 0.860 ± 0.076
AdaBoost (×50) 0.625 ± 0.033 0.499 ± 0.097 0.394 ± 0.138 0.856 ± 0.074
CRDA (λ = 1.0) 0.689 ± 0.010 0.726 ± 0.009 0.824 ± 0.021 0.553 ± 0.025
CRDA (λ = 10.0) 0.677 ± 0.012 0.722 ± 0.009 0.840 ± 0.020 0.513 ± 0.029
CRDA (λ = 100.0) 0.666 ± 0.014 0.719 ± 0.007 0.853 ± 0.027 0.479 ± 0.050
LDA 0.644 ± 0.009 0.645 ± 0.012 0.648 ± 0.023 0.640 ± 0.020
Logistic Regression 0.605 ± 0.057 0.424 ± 0.204 0.339 ± 0.200 0.870 ± 0.089
De-Sparse (λ = 1.0) 0.690 ± 0.007 0.719 ± 0.007 0.791 ± 0.022 0.589 ± 0.027
De-Sparse (λ = 10.0) 0.684 ± 0.009 0.726 ± 0.008 0.837 ± 0.015 0.531 ± 0.012
De-Sparse (λ = 100.0) 0.671 ± 0.012 0.721 ± 0.008 0.848 ± 0.026 0.494 ± 0.039
SVM 0.663 ± 0.013 0.673 ± 0.015 0.694 ± 0.024 0.632 ± 0.023
DIAG 0.633 ± 0.011 0.619 ± 0.028 0.599 ± 0.055 0.668 ± 0.046
Shrinkage (β = 0.25) 0.625 ± 0.044 0.569 ± 0.192 0.562 ± 0.193 0.689 ± 0.108
Shrinkage (β = 0.5) 0.626 ± 0.044 0.569 ± 0.192 0.561 ± 0.192 0.691 ± 0.106
F
Shrinkage (β = 0.75) 0.639 ± 0.011 0.633 ± 0.022 0.624 ± 0.039 0.653 ± 0.025
O
Days in Advance: 60
O
AdaBoost (×10) 0.635 ± 0.026 0.539 ± 0.087 0.449 ± 0.141 0.820 ± 0.091
PR
AdaBoost (×50) 0.634 ± 0.027 0.536 ± 0.089 0.445 ± 0.144 0.823 ± 0.091
CRDA (λ = 1.0) 0.692 ± 0.006 0.729 ± 0.006 0.827 ± 0.014 0.557 ± 0.015
CRDA (λ = 10.0) 0.682 ± 0.008 0.730 ± 0.004 0.860 ± 0.019 0.504 ± 0.031
D
CRDA (λ = 100.0) 0.674 ± 0.014 0.726 ± 0.005 0.864 ± 0.025 0.483 ± 0.051
TE
De-Sparse (λ = 10.0) 0.689 ± 0.004 0.733 ± 0.004 0.854 ± 0.013 0.524 ± 0.017
De-Sparse (λ = 100.0) 0.676 ± 0.013 0.727 ± 0.005 0.863 ± 0.024 0.488 ± 0.048
R
Shrinkage (β = 0.5) 0.642 ± 0.010 0.634 ± 0.015 0.620 ± 0.028 0.663 ± 0.028
C
Shrinkage (β = 0.75) 0.641 ± 0.010 0.636 ± 0.012 0.627 ± 0.021 0.655 ± 0.022
N
Days in Advance: 90
U
AdaBoost (×10) 0.633 ± 0.027 0.536 ± 0.089 0.447 ± 0.140 0.818 ± 0.086
AdaBoost (×50) 0.631 ± 0.026 0.535 ± 0.087 0.445 ± 0.137 0.818 ± 0.085
CRDA (λ = 1.0) 0.686 ± 0.006 0.721 ± 0.009 0.813 ± 0.029 0.558 ± 0.026
CRDA (λ = 10.0) 0.675 ± 0.007 0.720 ± 0.006 0.838 ± 0.021 0.512 ± 0.028
CRDA (λ = 100.0) 0.671 ± 0.009 0.719 ± 0.004 0.844 ± 0.028 0.497 ± 0.043
LDA 0.648 ± 0.009 0.648 ± 0.018 0.651 ± 0.037 0.644 ± 0.025
Logistic Regression 0.628 ± 0.028 0.520 ± 0.095 0.427 ± 0.146 0.828 ± 0.090
De-Sparse (λ = 1.0) 0.687 ± 0.009 0.713 ± 0.014 0.778 ± 0.033 0.597 ± 0.022
De-Sparse (λ = 10.0) 0.683 ± 0.006 0.725 ± 0.008 0.839 ± 0.021 0.527 ± 0.018
De-Sparse (λ = 100.0) 0.673 ± 0.008 0.720 ± 0.005 0.841 ± 0.024 0.505 ± 0.037
SVM 0.666 ± 0.009 0.672 ± 0.014 0.687 ± 0.030 0.644 ± 0.023
DIAG 0.635 ± 0.015 0.621 ± 0.030 0.601 ± 0.053 0.668 ± 0.032
Shrinkage (β = 0.25) 0.638 ± 0.012 0.631 ± 0.027 0.621 ± 0.051 0.656 ± 0.032
Shrinkage (β = 0.5) 0.642 ± 0.011 0.635 ± 0.026 0.626 ± 0.050 0.657 ± 0.032
Shrinkage (β = 0.75) 0.641 ± 0.010 0.635 ± 0.024 0.628 ± 0.046 0.655 ± 0.030
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
681 compared to typical LDA. On opposite side of the trade- of Θ from the inverse of sample covariance estimation 730
682 off, when compared to CRDA (based on graphical lasso), (i.e., Θ̄ = Σ̄ −1 ) as 731
683 De-Sparse on average gains 2.3% higher specificity while
R(Θ) = Θ − Θ̄ 2
2 − ΘGT − Θ 2
2, (20)
684 sacrificing 1.4% sensitivity.
where Θ = Θ for CRDA, Θ = T for De-Sparse, and Θ = 732
685 4.3.3 Discussion on the performance comparison Θβ for Shrinkage LDA. We repeated above (1)–(4) steps for 733
100 times, and illustrated the average error reduction R(Θ) 734
686 We consider testing accuracy and F1-score as two primary in Fig. 2a, with varying parameters and settings. 735
687 metrics for the evaluation, as these two metrics well char- Figure 2a demonstrates that, estimators used in De- 736
688 acterize the performance of classifiers. Thus, we conclude Sparse (T) and CRDA (Θ) outperform the sample 737
689 that De-Sparse overall outperforms the baseline algorithms, estimation in all settings, while DIAG and Shrinkage 738
690 including both LDA, SVM, Logistic Regression, and other estimators (i.e., Θβ and β = 0.0, 0.25, 0.5, and 0.75) 739
691 classifiers, in all experimental settings. In terms of the may cause even higher estimation error (with negative 740
692 trade-off between sensitivity and specificity, we argue that error reduction) when the number of samples increases. 741
693 De-Sparse still outperforms the original LDA classifier and Figure 2b illustrates the trend of error reduction with CRDA 742
694 CRDA classifiers, considering the requests of predictive and De-Sparse. Though the difference between these two 743
695 analytics of diseases and the early diagnosis. While the
F
Q3 696 original LDA classifier well-balances the sensitivity and
O
697 specificity, both CRDA and De-Sparse would incorporate
698 slightly higher sensitivity, compared to the original LDA,
O
Q4 699 while having lower specificity. In this way, CRDA and
PR
700 De-Sparse could discover more patients potentially with
701 the diseases, but also slightly raise the frequency of false
702 alarms. We believe, compared to the marginal increase
D
703 of false alarms, the improvement of sensitivity should be
TE
744 algorithms is not visible in such scale, we can observe that 5. Krzanowski WJ, Jonathan P, McCarthy WV, Thomas MR (1995) 791
745 these two algorithms achieve the maximal error reduction Discriminant analysis with singular covariance matrices: methods 792
and applications to spectroscopic data. Appl Stat, pp 101–115 793
746 when number of samples is 150 in our experiments, while
6. Belhumeur PN, Hespanha JP, Kriegman DJ (1996) Eigenfaces vs. 794
747 the error reduction is low when the number of samples fisherfaces: Recognition using class specific linear projection. In: 795
748 is relatively small (50) or large (250). Because, when the ECCV (1), vol 1064. Springer, pp 45–58 796
749 sample size is small, both sample-based estimation (Θ̄) and 7. Ye J, Janardan R, Li Q (2004) Two-dimensional linear discrimi- 797
F
758 covariance-regularized discriminant analysis for classifica- 13. Cai TT, Ren Z, Zhou HH, et al. (2016) Estimating structured high- 812
dimensional covariance and precision matrices: Optimal rates and 813
O
759 tion under high-dimensional low sample sizes (HDLSS)
adaptive estimation. Electronic Journal of Statistics 10(1):1–59 814
760 settings. More specific, we take care of the applications to
O
14. Zollanvari A, Dougherty ER (2013) Random matrix theory in 815
761 the predictive analytics of diseases using Electronic Health pattern classification An application to error estimation. In: 2013 816
PR
762 Records (EHRs) data and common diagnosis-frequency Asilomar Conference on Signals, Systems and Computers 817
763 data representation. To understand the performance of LDA, 15. Marčenko VA, Pastur LA (1967) Distribution of eigenvalues for 818
some sets of random matrices. Mathematics of the USSR-Sbornik 819
764 we extend the existing theory [14, 25] and propose a novel 1(4):457 820
D
765 analytical model characterizing the error rate of LDA clas- 16. Iain M (2001) Johnstone. On the distribution of the largest 821
TE
766 sification under the uncertainty of parameter estimation. eigenvalue in principal components analysis. Annals of Statistics, 822
767 Based on the analytical model, we propose De-Sparse – a pp 295–327 823
17. Rothman AJ, Bickel PJ, Levina E, Zhu J, et al. (2008) Sparse 824
768 novel LDA classifier using de-sparsified Graphical Lasso.
EC
771 analysis (CRDA) based on common Graphical Lasso. The electronic health records (ehrs) a survey. ACM Computing 828
Surveys (CSUR) 50(6):1–40 829
experimental results on real-world Electronic Health Record
R
772
19. Wang F, Sun J (2015) Psf: A unified patient similarity evaluation 830
773 (EHR) datasets show De-Sparse outperforms all baseline framework through metric learning with weak supervision. IEEE 831
O
774 algorithms. We interpret the comparison of results and J Biomed Health Informatics 19(3):1053–1060 832
C
775 demonstrate the advantage of proposed methods in medi- 20. Sun J, Wang F, Hu J, Edabollahi S (2012) Supervised patient 833
similarity measure of heterogeneous patient records. ACM 834
N
776 care settings. Further, the empirical studies on estimator SIGKDD Explorations Newsletter 14(1):16–24 835
Q5
777 comparison validate our analysis. 21. Ng K, Sun J, Hu J, Wang F (2015) Personalized predictive
U
836
modeling and risk factor identification using patient similarity. 837
AMIA Summit on Clinical Research Informatics (CRI) 838
22. Zhang J, Xiong H, Huang Y, Wu H, Leach K, Barnes L (2015) 839
778 References MSEQ Early detection of anxiety and depression via temporal 840
orders of diagnoses in electronic health data. In: 2015 International 841
779 1. Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd Conference on Big Data (Workshop), IEEE 842
780 edn. Wiley, Hoboken 23. Jensen S, SPSS UK (2001) Mining medical data for predictive 843
781 2. Peck R, Ness JV (1982) The use of shrinkage estimators in and sequential patterns: Pkdd 2001. In: Proceedings of the 5th 844
782 linear discriminant analysis. IEEE Trans Pattern Anal Mach Intell European conference on principles and practice of knowledge 845
783 5:530–537 discovery in databases 846
784 3. Xiong H, Cheng W, Bian J, Hu W, Sun Z, Guo Z (2018) DBSDA 24. Liu C, Wang F, Hu J, Xiong H (2015) Temporal phenotyping from 847
785 Lowering the bound of misclassification rate for sparse linear longitudinal electronic health records: A graph based framework. 848
786 discriminant analysis via model debiasing. IEEE Trans Neural In: Proceedings of the 21th ACM SIGKDD International 849
787 Netwo Learning Sys 30(3):707–717 Conference on Knowledge Discovery and Data Mining, KDD ’15. 850
788 4. Buhlmann P, Van De Geer S (2011) Statistics for high- ACM, New York, pp 705–714 851
789 dimensional data: methods, theory and applications. Springer, 25. Lachenbruch PA, Mickey RM (1968) Estimation of error rates in 852
790 Berlin discriminant analysis. Technometrics 10(1):1–11 853
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
854 26. Bian J, Barnes L, Chen G, Xiong H (2017) Early detection 43. Shickel B, Tighe PJ, Bihorac A, Rashidi P (2017) Deep ehr: 917
855 of diseases using electronic health records data and covariance- a survey of recent advances in deep learning techniques for 918
856 regularized linear discriminant analysis. In: IEEE EMBS Inter- electronic health record (ehr) analysis. IEEE J Biomed Health 919
857 national Conference on Biomedical & Health Informatics (BHI), Informatics 22(5):1589–1604 920
858 p 2017 44. Solares JRA, Raimondi FED, Zhu Y, Rahimian F, Canoy D, Tran 921
859 27. Jankova J, van de Geer S et al (2015) Confidence intervals for J, Gomes ACP, Payberah AH, Zottoli M, Nazarzadeh M, et al. 922
860 high-dimensional inverse covariance estimation. Electronic J Stat (2020) Deep learning for electronic health records: A comparative 923
861 9(1):1205–1229 review of multiple deep neural architectures. J Biomed Inform 924
862 28. Turner JC, Keller A (2015) College Health Surveillance Network: 101:103337 925
863 Epidemiology and Health Care Utilization of College Students at 45. HCUP (2014) Appendix a - clinical classification software- 926
864 U.S. 4-Year Universities. Journal of American College Health, pp diagnoses 927
865 530–538 46. Sun L, Zhang X, Qian Y, Xu J, Zhang S (2019) Feature selection 928
866 29. Van Vleck TT, Elhadad N (2010) Corpus-based problem selection using neighborhood entropy-based uncertainty measures for gene 929
867 for ehr note summarization. In: AMIA Annual symposium expression data classification. Inform Sci 502:18–41 930
868 proceedings, vol 2010, p 817. American Medical Informatics 47. Sun L, Zhang X, Qian Y, Xu J, Zhang S, Tian Y (2019) Joint 931
869 Association neighborhood entropy-based gene selection method with fisher 932
870 30. Yu S, Berry D, Bisbal J (2011) Performance analysis and score for tumor classification. Appl Intell 49(4):1245–1259 933
871 assessment of a tf-idf based archetype-snomed-ct binding 48. Chen L, Wang S (2012) Automated feature weighting in naive 934
872 algorithm. In: 2011 24th International Symposium on Computer- bayes for high-dimensional data classification. In: Proceedings 935
873 Based Medical Systems (CBMS). IEEE, pp 1–6 of the 21st ACM International conference on information and 936
874 31. Shen F, Sohn S, Rastegar-Mojarad M, Liu S, Pankratz JJ, knowledge management, pp 1243–1252 937
F
875 Hatton MA, Sowada N, Shrestha OK, Shurson SL, Liu H (2017) 49. Wan H, Wang H, Guo G, Wei X (2017) Separability-oriented 938
876 Populating physician biographical pages based on EMR data. subclass discriminant analysis. IEEE Trans Pattern Anal Mach 939
O
877 AMIA Summits on Translational Science Proceedings 2017:522 Intell 40(2):409–422 940
O
878 32. Luhn HP (1957) A statistical approach to mechanized encoding 50. Yang X, Jiang X, Tian C, Wang P, Zhou F, Fujita H (2020) 941
879 and searching of literary information. IBM J Res Dev 1(4):309– Inverse projection group sparse representation for tumor classifi- 942
PR
880 317 cation: A low rank variation dictionary approach. Knowl.-Based 943
881 33. Jones KS (1972) A statistical interpretation of term specificity and Syst 196(21):105768. [Link] 944
882 its application in retrieval. Journal of Documentation 10576805768 945
883 34. Aizawa A (2003) An information-theoretic perspective of tf–idf 51. Xiao Q, Dai J, Luo J, Fujita H (2019) Multi-view manifold 946
D
884 measures. Information Processing & Management 39(1):45–65 regularized learning-based method for prioritizing candidate 947
TE
885 35. Dubberke ER, Reske KA, McDonald LC, Fraser VJ (2006) disease miRNAs. Knowl.-Based Syst 175:118–129. [Link] 948
886 Icd-9 codes and surveillance for clostridium difficile–associated [Link]/science/article/pii/S0950705119301480 949
887 disease. Emerging Infectious Diseases 12(10):1576 52. Marozzi M (2015) Multivariate multidistance tests for high- 950
EC
888 36. Kowsari K, Meimandi KJ, Heidarysafa M, Mendu S, Barnes dimensional low sample size case-control studies. Stat Med 951
889 L, Brown D (2019) Text classification algorithms: A survey. 34(9):1511–1526 952
890 Information 10(4):150 53. Field C (1982) Small sample asymptotic expansions for multivari- 953
R
891 37. Choi E, Bahadori MT, Searles E, Coffey C, Thompson M, Bost J, ate m-estimates. Ann Stat, 672–689 954
892 Tejedor-Sojo J, Sun J (2016) Multi-layer representation learning 54. Blanchard G, Kawanabe M, Sugiyama M, Spokoiny V (2006) 955
R
893 for medical concepts. In: Proceedings of the 22nd ACM SIGKDD Klaus-Robert MÞller In search of non-gaussian components of a 956
894 International conference on knowledge discovery and data mining, high-dimensional distribution. J Mach Learn Res 7(Feb):247–282 957
O
895 pp 1495–1504 55. Zollanvari A, Braga-Neto UM, Dougherty ER (2011) Analytic 958
C
896 38. Zhang J, Kowsari K, Harrison JH, Lobo JM, Barnes LE (2018) study of performance of error estimators for linear discriminant 959
897 Patient2vec: A personalized interpretable deep representation of analysis. IEEE Trans Signal Process 59(9):4238–4255 960
N
898 the longitudinal electronic health record. IEEE Access 6:65333– 56. Banerjee O, El Ghaoui L, d’Aspremont A (2008) Model selection 961
65346 through sparse maximum likelihood estimation for multivariate
U
899 962
900 39. Choi E, Bahadori MT, Le S, Stewart WF, Sun J (2017) gaussian or binary data. J Mach Learn Res 9(Mar):485–516 963
901 Gram: graph-based attention model for healthcare representation 57. Kendler KS, Hettema JM, Butera F, Gardner CO, Prescott CA 964
902 learning. In: Proceedings of the 23rd ACM SIGKDD international (2003) Life event dimensions of loss, humiliation, entrapment, 965
903 conference on knowledge discovery and data mining, pp 787–795 and danger in the prediction of onsets of major depression and 966
904 40. Bai T, Zhang S, Egleston BL, Vucetic S (2018) Interpretable generalized anxiety. Arch Gen Psychiatry 60(8):789–796 967
905 representation learning for healthcare via capturing disease 58. Ye J, Janardan R, Park CH, Park H (2004) An optimization 968
906 progression through time. In: Proceedings of the 24th ACM criterion for generalized discriminant analysis on undersampled 969
907 SIGKDD international conference on knowledge discovery & data problems. IEEE Trans Pattern Anal Mach Intell 26(8):982–994 970
908 mining, pp 43–51 59. Huang SH, LePendu P, Iyer SV, Tai-Seale M, Carrell D, Shah NH 971
909 41. Ma T, Xiao C, Wang F (2018) Health-atm: A deep architecture (2014) Toward personalizing treatment for depression: predicting 972
910 for multifaceted patient health record representation and risk diagnosis and severity. J Am Med Inform Assoc 21(6):1069–1075 973
911 prediction. In: Proceedings of the 2018 SIAM International 60. Altman DG, Bland JM (1994) Diagnostic tests. 1: Sensitivity and 974
912 Conference on Data Mining. SIAM, pp 261–269 specificity. Br Med J 308(6943):1552 975
913 42. Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu
914 PJ, Liu X, Marcus J, Sun M, et al. (2018) Scalable and accurate
915 deep learning with electronic health records. NPJ Digital Medicine Publisher’s note Springer Nature remains neutral with regard to 976
916 1(1):18 jurisdictional claims in published maps and institutional affiliations. 977
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
Affiliations
Sijia Yang1 · Haoyi Xiong2 · Kaibo Xu3 · Licheng Wang1 · Jiang Bian4 · Zeyi Sun3
F
988 1 School of Cyberspace Security, State Key Laboratory
989 of Networking and Switching, Beijing University of Posts
O
990 and Telecommunications, Haidian, Beijing, China
O
991 2 Department of Computer Science, Missouri University
992 of Science and Technology, Rolla, MO 65409, USA
PR
993 3 Mininglamp Academy of Sciences, Mininglamp Technology,
994 Beijing, 100084, China
995 4 Department of Electrical and Computer Engineering,
D
996 University of Central Florida, Orland, FL 32816, USA
TE
EC
R
R
O
C
N
U
AUTHOR'S PROOF JrnlID 10489 ArtID 1810 Proof#1 - 07/07/2020
AUTHOR QUERIES