0% found this document useful (0 votes)

45 views

Gsimp: A Gibbs Sampler Based Left-Censored Missing Value Imputation Approach For Metabolomics Studies

Uploaded by

Mila Anasanti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

Gsimp: A Gibbs Sampler Based Left-Censored Missing Value Imputation Approach For Metabolomics Studies

Uploaded by

Mila Anasanti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410.

The copyright holder for this preprint

(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

GSimp: A Gibbs sampler based left-censored missing value

imputation approach for metabolomics studies

Runmin Wei1, 2, #, Jingye Wang1, #*, Erik Jia3, Tianlu Chen4, Yan Ni1, Wei Jia1

1
University of Hawaii Cancer Center, Honolulu, HI 96813, USA
2
Department of Molecular Biosciences and Bioengineering, University of Hawaii at
Manoa, Honolulu, HI 96822, USA
3
Punahou School, Honolulu, HI 96822, USA
4
Shanghai Key Laboratory of Diabetes Mellitus and Center for Translational
Medicine, Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, Shanghai
200233, China

#
These authors contributed equally to this work.
*
Corresponding Author: Jingye Wang, MD, MPH, Email: [email protected]
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

ABSTRACT
Motivation: Left-censored missing values commonly exist in targeted metabolomics
datasets and can be considered as missing not at random (MNAR). Improper data
processing procedures for missing values will cause adverse impacts on subsequent
statistical analyses. However, few imputation methods have been developed and
applied to the situation of MNAR in the field of metabolomics. Thus, a practical
left-censored missing value imputation method is urgently needed.

Results: We have developed an iterative Gibbs sampler based left-censored missing

value imputation approach (GSimp). We compared GSimp with other three
imputation methods on two real-world targeted metabolomics datasets and one
simulation dataset using our imputation evaluation pipeline. The results show that
GSimp outperforms other imputation methods in terms of imputation accuracy,
observation distribution, univariate and multivariate analyses, and statistical
sensitivity.

Availability and implementation: The R code for GSimp, evaluation pipeline,

vignette, real-world and simulated targeted metabolomics dataset are available on:
https://github.com/WandeRum/GSimp.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

1 INTRODUCTION
Missing values are commonly observed in mass spectrometry (MS) based
metabolomics datasets. Many statistical methods require a complete dataset, which
makes missing data an inevitable problem for subsequent data analysis. Generally,
there are three types of missing values, missing not at random (MNAR), missing at
random (MAR) and missing completely at random (MCAR) (Gelman and Hill, 2006;
Little and Rubin, 2002). Unexpected missing values are considered as MCAR if they
originate from random errors and stochastic fluctuations during the data acquisition
process (e.g., incomplete derivatization or ionization). MAR assumes the probability
of a variable being missing depends on other observed variables (Gelman and Hill,
2006; Little and Rubin, 2002). Thus, missing values due to suboptimal data
preprocessing, e.g., inaccurate peak detection and deconvolution of co-eluting
compounds can be defined as MAR. Targeted metabolomics studies have been widely
used for the accurate quantification of specific groups of metabolites. Due to the limit
of compound quantifications (LOQ), missing values are usually caused by signal
intensities lower than LOQ, also known as left-censored missing, which can be
assigned to MNAR.

The processing of missing values has been developed and studied in MS data, which
is an indispensable step in the metabolomics data processing pipeline (Hrydziuszko
and Viant, 2012). One simple but naïve solution is the substitution of missing by
determined values, such as zero, half of the minimum value (HM) or LOQ/c where c
denotes a positive integer. Determined value substitutions, although commonly
applied for dealing with missing values in metabolomics studies (Guo et al., 2015;
Liu et al., 2016; Butte et al., 2015), can significantly affect the subsequent statistical
analyses in different ways, e.g. underestimate variances of missing variables, decrease
statistical power, fabricate pseudo-clusters among observations, etc. (Gelman and Hill,
2006). Advanced statistical imputation methods have been developed for –omics
studies, e.g., k-nearest neighbors (kNN) imputation (Troyanskaya et al., 2001),
singular value decomposition (SVD) imputation (Hastie et al., 1999; Stacklies et al.,
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

2007), random forest (RF) imputation (Stekhoven and Bühlmann, 2012). Several
metabolomics data analysis software tools provide different methods of dealing with
missing values (Mak et al., 2014; Katajamaa et al., 2006; Kessler et al., 2013;
Luedemann et al., 2012; Xia et al., 2015). MetaboAnalyst (Xia et al., 2009, 2012,
2015), one widely used metabolomics analysis toolkit, provides Probabilistic PCA
(PPCA), Bayesian PCA (BPCA) and SVD imputation. However, these methods are
mainly aiming at imputing MCAR/MAR and not suitable for the situation of MNAR.
A limited number of approaches dealing with left-censored missing values were
applied by researchers (Lazar et al., 2016; Shah et al., 2017). Quantile regression
approach for left-censored missing (QRILC) imputes missing data using random
draws from a truncated distribution with parameters estimated using quantile
regression (Lazar et al., 2016). Although this imputation keeps the overall distribution
of missing parts compared to determined value substitutions, it may produce random
results since no more information is used for the prediction of missing parts. Another
imputation method recently developed for MNAR is k-nearest neighbor truncation
(kNN-TN) by Shah, et al. (Shah et al., 2017). This approach applies Maximum
Likelihood Estimators (MLE) for the means and standard deviations of missing
variables based on truncated normal distribution. Then a Pearson correlation based
kNN imputation method was implemented on standardized data. Although the author
stated that kNN-TN could impute both MNAR and MAR, the imputed values were
entirely dependent on the nearest neighbors while no constraint was placed upon the
imputation. Thus, this approach might cause an overestimation of missing values.

To reduce adverse effects caused by missing values during metabolomics data

analyses, we developed a left-censored missing value imputation framework, GSimp,
where a prediction model was embedded in an iterative Gibbs sampler. We then
compared GSimp with HM, QRILC, and kNN-TN on two real-world metabolomics
datasets and one simulation dataset to demonstrate the advantages of GSimp
regarding imputation accuracy, observation distribution, univariate analysis,
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

multivariate analysis and sensitivity. Our findings indicate that GSimp is a robust
method to handle left-censored missing values in targeted metabolomics studies.

2 METHODS
2.1 Dataset
Diabetes datasets
We employed datasets from a study of comparing serum metabolites between obese
subjects with diabetes mellitus (N=70) and healthy controls (N=130) where N
represents the number of observations. Dataset 1: a total of 42 free fatty acids (FFAs)
were identified and quantified in those participants in order to evaluate their FFA
profiles (Ni et al., 2015). Dataset 2: a total of 34 bile acids (BAs) were identified and
quantified in a similar way using different analytical protocol (Lei et al., 2017).
Simulation dataset
For the simulation dataset, we first calculated the covariance matrix Cov based on the
whole diabetes dataset (P=76) where P represents the number of variables. Then we
generated two separated data matrices with the same number of 80 observations from
multivariate normal distributions, representing two different biological groups. For
each data matrix, the sample mean of each variable was drawn from a normal
distribution 0, 0.5 and Cov was kept using SVD. Then, two data matrices were
horizontally (column-wise) stacked together as a complete data matrix (N×P=160×76)
so that group differences were simulated and covariance was kept.
2.2 MNAR generation
For two real-world targeted metabolomics datasets, we generated a series of MNAR
datasets by using the missing proportion (number of missing variables/number of total
variables) from 0.1 to 0.6 in a step of 0.05 with MNAR cut-off for each missing
variable drawn from a uniform distribution 0.1, 0.5. The elements lower than the
corresponding cut-off were removed and replaced with NA. For the simulation dataset,
we generated a series of MNAR datasets by using the missing proportion from 0.1 to
0.8 step by 0.1 with MNAR cut-off drawn from 0.3, 0.6 for a more rigorous
testing.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

2.3 Prediction model

A prediction model was employed for the prediction of missing values by setting a
targeted missing variable as outcome and other variables as predictors. Different
prediction models, e.g., linear regression, elastic net (Zou and Hastie, 2005),
regression trees (Breiman et al., 1984) and random forest (Breiman, 2001), etc. could
be embedded in our imputation framework. Elastic net was applied in our approach as
an ideal prediction model considering its stability, accuracy, and efficiency. This
model is a regularized regression with the combination of L1 and L2 penalties of the
LASSO (Tibshirani, 1996) and ridge (Hoerl and Kennard, 1970) methods. The
estimates of regression coefficients in elastic net are defined as
argmin 1 /2

The L2 penalty 1 /2 improves the model’s robustness by controlling the

multicollinearities among variables which are widely existed in high-dimensional
–omics data. And the L1 penalty controls the number of predictors by
assigning zero coefficients to the "unnecessary" predictors. From a Bayesian point of
view, the regularization is a mixture of Gaussian and Laplacian prior distributions of
coefficients which can pull the full model of maximum likelihood estimates
towards the null model of prior coefficients distribution, thus controls the
risk of overfitting and increase the model robustness. R package glmnet was used for
the elastic net. We set hyperparameters as 0.01 (default setting for
highly-dimensional data) and as 0.5 (an equally mixture of LASSO and ridge
penalties) (Friedman et al., 2015).
2.4 Gibbs sampler
Gibbs sampler is a Markov Chain Monte Carlo (MCMC) technique that sequentially
updates parameters while others are fixed. It can be used to generate posterior
samples. For each missing variable in the dataset, we applied a Gibbs sampler to
impute the missing values by sampling from a truncated normal distribution with
prediction model fitted value as mean and root mean square deviation (RMSD) of
missing part as standard deviation while truncated by specified cut-points. Assuming
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

we have a n × p data matrix X = (X1, X2, X3, …, Xp) with only one variable Xj
containing left-censored missing values. We denote Xj as y and the missing part as ym
with length m and non-missing part as yf with length f, and the rest of matrix X-j as ! .
We can then set the lower truncation point lo as -∞ (centralized data) or 0 (original
data) and upper hi as the minimum value of yf or a given LOQ. The truncation bounds
ensure imputation results are constrained within [lo, hi]. Then, the Gibbs sampler
approach can be described as following steps:
Step-1 (initialization): we initialize missing values (QRILC in our case), and get ";
Step-2 (prediction): we then build a prediction model (elastic net in our case):
" ~ !;
Step-3 (estimation): based on the prediction model, we get the predicted value "$ and
೘ ᇲ మ
the root mean square deviation (RMSD) of missing part % = &∑೔సభ೘೔ ೘೔ where

೔ and '೔ are ith initialized/imputed value and fitted value respectively;

Step-4 (sampling): we draw sample (೔ from a truncated normal distribution

)'೔ , % * +,, -. / for ith missing element and update ".

We iteratively repeat step-2 to step-4 and update Xj.
2.5 GSimp framework
A whole data matrix X = (X1, X2, X3, …, Xp) contains a number of k (k ≤ p)
left-censored missing variables. We present our imputation framework as following
algorithm.
Algorithm: Gibbs sampler based left-censored missing value imputation approach

Require: X an n × p data matrix, iters_all the number of iterations for imputing the
whole matrix X, iters_each the number of iterations for imputing each missing
variable, a vector of upper limits U (+∞ for non-missing variables) and a vector of

lower limits L (-∞ for non-missing variables) with length p.

1. ! initialize the missing values for X;
2. K vector of indices of missing variables in X with increasing amount of
missing values;
3. for 1:iters_all do
4. for j in K do
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

5. " ′
! , " ′
can be divided into two parts: "′

is a vector of the

imputed part (original missing part) with length m and "′

is a vector
of the non-missing part with length f while n = m + f;

6. ! ′
!
, represents the matrix X with jth column removed;
7. lo Lj and hi Uj;
8. for 1:iters_each do
9. Gibbs sampler step 2 to 4;
10. end for
11. Update ! ;
12. end for
13. end for
14. return !
2.6 Other imputation approaches
Other three left-censored missing imputation/substitution methods were conducted in
our study for performance comparison:
• kNN-TN (Truncation k-nearest neighbors imputation) (Shah et al., 2017): this
method applied a Newton-Raphson (NR) optimization to estimate the truncated
mean and standard deviation. Then, Pearson correlation was calculated based on
standardized data followed by correlation-based kNN imputation.
• QRILC (Quantile Regression Imputation of Left-Censored data) (Lazar, 2015):
this method imputes missing elements randomly drawing from a truncated
distribution estimated by a quantile regression. R package imputeLCMD was
applied for this imputation approach.
• HM (Half of the Minimum): This method replaces missing elements with half of
the minimum of non-missing elements in the corresponding variable.

2.7 Assessments of performance

The assessments of imputation performance were conducted using an imputation
evaluation pipeline from our previous study with both unlabeled and labeled
measurements (Wei et al., 2017), which is accessible through:
https://github.com/WandeRum/MVI-evaluation. Unlabeled measurements include the
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

NRMSE-based sum of ranks (SOR), principal component analysis (PCA)-Procrustes

analysis while labeled measurements include correlation analysis for univariate results,
partial least square (PLS)-Procurstes analysis. R package vegan was applied for
Procrustes analysis (Oksanen, 2015) and ropls was applied for PLS analysis
(Thévenot et al., 2015).
Furthermore, we evaluated the impacts of different imputation methods on the
statistical sensitivity of detecting biological variances. On the simulation dataset, we
calculated p-values from student’s t-tests between two groups from original as well as
imputed datasets. We marked a set S as real differential variables at a significant level
of p-cutoff (e.g. 0.05) from original simulation data, and a set S’ as detected
differential variables at the same significant level from imputed simulation data. Then
# ᇲ
we calculated the true positive rate 012 #
to evaluate the effects of

different imputation methods in terms of detecting differential variables.

bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

3 RESULTS
3.1 Gibbs sampler in GSimp

A variable containing missing elements from FFA dataset was randomly selected to
track the sequence of corresponding parameters and estimates across the first 500
iterations out of a total of 2000 (100 × 20) iterations using GSimp. From Figure 1, we
can observe that both fitted value ' and sample value ( reach to the convergence
after iterations and the standard deviation estimate % drop to a steady state with
small values. In addition, an upper constraint for the distribution of ( indicated that
it was drawn from a truncated normal distribution.

3.2 Imputation comparisons

We evaluated four different MNAR imputation/substitution methods on FFA, BA

targeted metabolomics and simulation datasets. First, we measured the imputation
performances using label-free approaches. SOR was used to measure the imputation
accuracy regarding the imputed values of each missing variable. From the upper panel
of Figure 2, we can observe that GSimp has the best performance with the lowest
SOR across all varying numbers of missing variables in both FFA and BA datasets. To
measure the extent of imputation induced distortion on observation distributions, the
PCA-Procrustes analysis was conducted between the original data and imputed data.
The lower panel of Figure 2 shows that GSimp has the lowest Procrustes sum of
squared errors compared to other methods, which means GSimp kept the overall
observation distribution of original dataset with the least distortions.

Then, we measured the imputation performances with binary labels provided. We

compared the results of univariate and multivariate analyses for imputed and original
datasets. Since this is a case-control study, student’s t-tests were applied for univariate
analyses. Then we compared the results by calculating Pearson’s correlation between
log-transformed p-values calculated from imputed and original data for missing
variables. Again, GSimp performs best with the highest correlations among four
methods (upper panel of Figure 3) along with different numbers of missing variables,
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

and it implies GSimp keeps the most biological variations regarding the univariate
analyses results. For the multivariate analyses, we applied PLS-DA to distinguish the
group differences. Similarly, we conducted PLS-Procrustes analysis while PLS was
employed as a supervised dimension reduction technique. The lower panel of Figure 3
demonstrates that GSimp preferably restores the original observation distribution with
the lowest Procrustes sum of squared errors among four imputation methods.

On the simulation dataset, we compared QRILC, kNN-TN, and GSimp using same
approaches. Consistent results were recognized (Supplemental figure 1), and GSimp
presents the best performances on the simulation dataset with the lowest SOR and
PCA/PLS-Procrustes sum of squared errors and the highest correlation of univariate
analysis results. Moreover, to examine the influences of statistical power using
different imputation methods, we calculated TPR as the capacities to detect
differential variables on different imputation datasets. Again, with both p-cutoff of
0.05 and 0.01, GSimp shows the overall highest TPR over different missing numbers
(Figure 4). This implies that GSimp impairs the sensitivity to the least extent among
three methods, which is reasonable since GSimp also keeps the highest correlation of
p-values in previous comparisons.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

4 DISCUSSION

The purpose of this study is to develop a left-censored missing value imputation

approach for targeted metabolomics data analysis. We evaluated GSimp with other
three imputation methods (a.k.a kNN-TN, QRILC, and HM) and suggested that
GSimp was superior to others using different evaluation methods. To illustrate the
performance of GSimp, we randomly selected one variable containing missing values
from FFA dataset (Figure 5) to compare the imputed values and original values.
Although determined value substitution (e.g. HM) were widely used by researchers in
the field of metabolomics, our results indicated that HM could severely distort the
data distribution (upper left panel of Figure 5), thus impairing subsequent analyses. In
comparison, QRILC kept the overall data distribution and variances (upper right panel
of Figure 5). However, random values could be generated by this approach since
QRILC imputes each missing variable independently without utilizing the predictive
information from other variables. Statistical learning based method, kNN-TN, applied
a correlation based kNN algorithm with parameters of missing variables estimated
with truncated normal distributions. This method utilized the information of highly
correlated variables of targeted missing variable, thus kept a linear trend between
original values and imputed values. However, since no constraint was applied for the
imputation, a right shift of missing part might occur, causing imputed values to
exceed the truncation point (lower left panel of Figure 5). In contrast, GSimp utilized
the predictive information of other variables by employing a prediction model and
held a truncated normal distribution for each missing element simultaneously, which
ensured a favorable linear trend between imputed and original values as well as a
reasonable bound for the imputed values (lower right panel of Figure 5).

In our approach, truncated normal distribution was used for the constraint of
imputation results in Gibbs sampler steps. We applied the minimum observed value of
missing variable as an informative upper truncation point and -∞ as a non-informative
lower truncation point considering the situation of left-censored missing. Other values
could also be applied in real-world metabolomics analyses, such as a known LOQ of a
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

metabolite can be set as an upper truncation point. Additionally, when signal intensity
of certain compound is larger than the upper limit of quantification range or saturation
during instrument analysis, an informative lower truncation point could be
correspondingly be applied for the right-censored missing value. What’s more, when
non-informative bounds for both upper and lower limits (e.g., +∞, -∞) were applied,
our GSimp could be extended to the situation of MCAR/MAR. With the flexible
usage of upper and lower limits, our approach may provide a versatile and powerful
imputation technique for different missing types. Other –omics datasets with missing
values (especially MNAR), e.g. single cell RNA-sequencing data, could apply this
method with few modifications of our default settings. Thus, it is worthy to evaluate
our approach, GSimp, in other complex scenarios in the future.

Since GSimp employed an iterative Gibbs sampler method, a large number of

iterations (iters_all=20, iters_each=100) are preferable for the convergence of
parameters. However, as we tested on the simulation dataset with different number of
iterations, a much less iterations (iters_all=10, iters_each=50) won't severely affect
the imputation accuracy (Supplemental figure 2). Among iterations for the whole data
matrix, we applied a sequential imputation procedure for missing variables from the
least number of missing values to the most. Such sequential approach improves
imputation performances compared to parallel imputation approach.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

5 CONCLUSION
A practical left-censored missing value imputation method is needed in the field of
metabolomics. We develop a new imputation approach GSimp that outperforms
traditional determined value substitution method (HM) and other approaches (QRILC,
and kNN-TN) for MNAR situations. GSimp utilized predictive information of
variables and held a truncated normal distribution for each missing element
simultaneously via embedding a prediction model into the Gibbs sampler framework.
With proper modifications on the parameter settings, e.g. truncation points, GSimp
may be applicable to handle different types of missing values and in different -omics
studies, thus deserved to be further explored in the future.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

REFERENCES
Breiman,L. et al. (1984) Classification and Regression Trees.
Breiman,L. (2001) Random forests. Mach. Learn., 45, 5–32.
Butte,N.F. et al. (2015) Global metabolomic profiling targeting childhood obesity in
the Hispanic population. Am. J. Clin. Nutr., 102, 256–267.
Friedman,A.J. et al. (2015) Lasso and Elastic-Net Regularized Generalized Linear
Models. Available online
https//cran.r-project.org/web/packages/glmnet/glmnet.pdf. (Verified 29 July.
2015).
Gelman,A. and Hill,J. (2006) Data analysis using regression and
multilevel/hierarchical models.
Guo,L. et al. (2015) Plasma metabolomic profiles enhance precision medicine for
volunteers of normal health. Proc. Natl. Acad. Sci., 112, E4901–E4910.
Hastie,T. et al. (1999) Imputing missing data for gene expression arrays. Tech. Report,
Div. Biostat. Stanford Univ., 1–9.
Hoerl,A.E. and Kennard,R.W. (1970) Ridge Regression: Biased Estimation for
Nonorthogonal Problems. Technometrics, 12, 55–67.
Hrydziuszko,O. and Viant,M.R. (2012) Missing values in mass spectrometry based
metabolomics: An undervalued step in the data processing pipeline.
Metabolomics, 8, 161–174.
Katajamaa,M. et al. (2006) MZmine: toolbox for processing and visualization of mass
spectrometry based molecular profile data. Bioinformatics, 22, 634–6.
Kessler,N. et al. (2013) MeltDB 2.0-advances of the metabolomics software system.
Bioinformatics, 29, 2452–2459.
Lazar,C. et al. (2016) Accounting for the Multiple Natures of Missing Values in
Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.
J. Proteome Res., 15, 1116–1125.
Lei,S. et al. (2017) The ratio of dihomo-γ-linolenic acid to deoxycholic acid species is
a potential biomarker for the metabolic abnormalities in obesity. FASEB J.,
fj.201700055R.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

Liu,J.-J. et al. (2016) Profiling of plasma metabolites suggests altered mitochondrial

fuel usage and remodelling of sphingolipid metabolism in individuals with type 2
diabetes and kidney disease. Kidney Int. Reports, 2, 470–480.
Luedemann,A. et al. (2012) TagFinder: Preprocessing software for the fingerprinting
and the profiling of gas chromatography-mass spectrometry based metabolome
analyses. Methods Mol. Biol., 860, 255–286.
Mak,T.D. et al. (2014) MetaboLyzer: A novel statistical workflow for analyzing
postprocessed LC-MS metabolomics data. Anal. Chem., 86, 506–513.
Ni,Y. et al. (2015) Circulating Unsaturated Fatty Acids Delineate the Metabolic Status
of Obese Individuals. EBioMedicine, 2, 1513–1522.
Oksanen,J. (2015) Multivariate Analysis of Ecological Communities in R: vegan
tutorial.
Shah,J.S. et al. (2017) Distribution based nearest neighbor imputation for truncated
high dimensional data with applications to pre-clinical and clinical metabolomics
studies. BMC Bioinformatics, 18, 114.
Stacklies,W. et al. (2007) pcaMethods - A bioconductor package providing PCA
methods for incomplete data. Bioinformatics, 23, 1164–1167.
Stekhoven,D.J. and Bühlmann,P. (2012) Missforest-Non-parametric missing value
imputation for mixed-type data. Bioinformatics, 28, 112–118.
Thévenot,E.A. et al. (2015) Analysis of the Human Adult Urinary Metabolome
Variations with Age, Body Mass Index, and Gender by Implementing a
Comprehensive Workflow for Univariate and OPLS Statistical Analyses. J.
Proteome Res., 14, 3322–3335.
Tibshirani,R. (1996) Regression Selection and Shrinkage via the Lasso. J. R. Stat. Soc.
B, 58, 267–288.
Troyanskaya,O. et al. (2001) Missing value estimation methods for DNA microarrays.
Bioinformatics, 17, 520–525.
Wei,R. et al. (2017) Missing Value Imputation Approach for Mass
Spectrometry-based Metabolomics Data. bioRxiv.
Xia,J. et al. (2009) MetaboAnalyst: A web server for metabolomic data analysis and
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

interpretation. Nucleic Acids Res., 37.

Xia,J. et al. (2012) MetaboAnalyst 2.0-a comprehensive server for metabolomic data
analysis. Nucleic Acids Res., 40.
Xia,J. et al. (2015) MetaboAnalyst 3.0-making metabolomics more meaningful.
Nucleic Acids Res., 43, W251–W257.
Zou,H. and Hastie,T. (2005) Regularization and variable selection via the elastic net. J.
R. Stat. Soc. Ser. B Stat. Methodol., 67, 301–320.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

Figure legends

Fig. 1. Sequentially parameters updating in GSimp

The first 500 iterations out of a total of 2000 (100×20) iterations using GSimp where
' , ( and σ represent fitted value, sample value and standard deviation
correspondingly.

Fig. 2. Evaluations of different imputation methods using unlabeled approaches

SOR on FFA dataset (upper left) and BA dataset (upper right) along with different
numbers of missing variables based on four imputation methods: HM (red circle),
QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross).
PCA-Procrustes sum of squared errors on FFA dataset (lower left) and BA dataset
(lower right) along with different numbers of missing variables based on four
imputation methods: HM (red circle), QRILC (green triangle), GSimp (blue square),
and kNN-TN (purple cross).

Fig. 3. Evaluations of different imputation methods using labeled approaches

Pearson's correlation between log-transformed p-values of student’s t-tests on FFA

dataset (upper left) and BA dataset (upper right) along with different numbers of
missing variables based on four imputation methods: HM (red circle), QRILC (green
triangle), GSimp (blue square), and kNN-TN (purple cross). PLS-Procrustes sum of
squared errors on FFA dataset (lower left) and BA dataset (lower right) along with
different numbers of missing variables based on four imputation methods: HM (red
circle), QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross).

Figure 4. Evaluations of different imputation methods using TPR for various

p-cutoffs on simulation dataset

TPR along with different numbers of missing variables based on three imputation
methods: QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross)
among different p-cutoff=0.05 (left panel), and 0.01 (right panel).
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

Figure 5. Comparisons of imputed values and original values on an example

variable

Scatter plots of imputed values (X-axis) and original values (Y-axis) on one example
missing variable while non-missing elements represented as blue dots and missing
elements as red dots based on four imputation methods: HM (upper left), QRILC
(upper right), kNN-TN (lower left), and GSimp (lower right). Rug plots show the
distributions of imputed values and original values.

Supplemental Figure 1. Evaluations of different imputation methods on

simulation dataset

SOR (upper left), PCA-Procrustes sum of squared errors (upper right), Pearson's
correlation between log-transformed p-values of student’s t-tests (lower left), and
PLS-Procrustes sum of squared errors (lower right) on simulation dataset along with
different numbers of missing variables based on three imputation methods: QRILC
(green triangle), GSimp (blue square), and kNN-TN (purple cross).

Supplemental Figure 2. Evaluations of different numbers of iterations using

GSimp on simulation dataset

SOR on simulation dataset along with different numbers of missing variables based
on four different numbers of iterations: iters_each=50 and iters_all=20 (red circle),
iters_each=100 and iters_all=20 (green triangle), iters_each=50 and iters_all=10
(blue square), iters_each=100 and iters_all=10 (purple cross).
0.2

0.1

ỹ -0.1

-0.2

-0.3

-0.2
-0.1
0
0.1
0.1
0.2
ŷ 0.2 0.3
0.3 0.4
σ
0.4 0.5

0.6
FFA BA
80
80

60
60

HM
SOR

QRILC
40 GSimp
40
kNN-TN

20 20

0
5 10 15 20 25 5 10 15 20

Number of missing variables

FFA BA

0.015
0.03
PCA-Procrustes SS.

0.010
0.02 HM
QRILC
GSimp
kNN-TN

0.01 0.005

0.00 0.000

5 10 15 20 25 5 10 15 20

Number of missing variables

FFA BA
1.0 1.0

0.9 0.9
Correlation

HM
QRILC
GSimp
kNN-TN

0.8 0.8

0.7 0.7

5 10 15 20 25 5 10 15 20

Number of missing variables

FFA BA

0.06

0.015
PLS-Procrustes SS.

0.04

0.010
HM
QRILC
GSimp
kNN-TN

0.02
0.005

0.000 0.00

5 10 15 20 25 5 10 15 20

Number of missing variables

p-cutoff=0.05 p-cutoff=0.01
1.00

0.9
0.95

QRILC
TPR

TPR
GSimp
0.90 kNN-TN
0.8

0.85

0.7

0.80
20 40 60 20 40 60

Number of missing variables

HM QRILC

125 125

100 100
Original Value

Original Value
75 75

50 50

25 25
25 50 75 100 125 25 50 75 100 125

Imputed Value Imputed Value

kNN-TN GSimp

125 125

100 100
Original Value

Original Value

75 75

50 50

25 25
25 50 75 100 125 25 50 75 100 125

Imputed Value Imputed Value

Introduction To Genomics Second Edition PDF
100% (10)
Introduction To Genomics Second Edition PDF
420 pages
Factors That Influence Purchase Intentions in Social Commerce
No ratings yet
Factors That Influence Purchase Intentions in Social Commerce
11 pages
My Account - May 30, 2018 at 08:24 PDF
50% (2)
My Account - May 30, 2018 at 08:24 PDF
3 pages
An Efficient Ensemble Method For Missing Value Imputation in Microarray Gene Expression Data
No ratings yet
An Efficient Ensemble Method For Missing Value Imputation in Microarray Gene Expression Data
25 pages
Imputability
No ratings yet
Imputability
12 pages
The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and Solution For Improvement
No ratings yet
The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and Solution For Improvement
8 pages
A Robust Missing Value Imputation Method Mifoimpute For Incomplete Molecular Descriptor Data and Comparative Analysis With Other Missing Value Imputation Methods
No ratings yet
A Robust Missing Value Imputation Method Mifoimpute For Incomplete Molecular Descriptor Data and Comparative Analysis With Other Missing Value Imputation Methods
12 pages
Lecture10
No ratings yet
Lecture10
20 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
Unit - 3 - R Programming
No ratings yet
Unit - 3 - R Programming
16 pages
Roles of Imputation Methods For Filling The Missing Values: A Review
No ratings yet
Roles of Imputation Methods For Filling The Missing Values: A Review
9 pages
Imputation: - Applied Multivariate Analysis & Statistical Learning
No ratings yet
Imputation: - Applied Multivariate Analysis & Statistical Learning
17 pages
MIssing Data Imputation Using Machine Learning Algorithm
No ratings yet
MIssing Data Imputation Using Machine Learning Algorithm
11 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
A new variable importance measure for random forests with missing data;Hapfelmeier,Technische Universitat Munchen;web
No ratings yet
A new variable importance measure for random forests with missing data;Hapfelmeier,Technische Universitat Munchen;web
22 pages
White 2010
No ratings yet
White 2010
23 pages
Centraltendencywhattoconsider 1
No ratings yet
Centraltendencywhattoconsider 1
6 pages
c9e1efe8cf6a011c5bdc83aaee7b78650ec4dc68229e9578be602b3b36291df5
No ratings yet
c9e1efe8cf6a011c5bdc83aaee7b78650ec4dc68229e9578be602b3b36291df5
15 pages
DL vs Conventional
No ratings yet
DL vs Conventional
14 pages
McCombe Etal Supplementary Materials 2021
No ratings yet
McCombe Etal Supplementary Materials 2021
6 pages
Missing Imput Values
No ratings yet
Missing Imput Values
2 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
Journal of Statistical Software: Imputation With The R Package VIM
No ratings yet
Journal of Statistical Software: Imputation With The R Package VIM
16 pages
Nonparametric Imputation by Data Depth PDF
No ratings yet
Nonparametric Imputation by Data Depth PDF
31 pages
Yana Bondarenko Statistical Analysis With Missing Values
No ratings yet
Yana Bondarenko Statistical Analysis With Missing Values
5 pages
FDS_U4.pptx
No ratings yet
FDS_U4.pptx
93 pages
FullInformationMultipleImputationforLinearRegressionModelwithMissingResponseVariable
No ratings yet
FullInformationMultipleImputationforLinearRegressionModelwithMissingResponseVariable
6 pages
Bioinformatics: Missing Value Estimation Methods For DNA Microarrays
No ratings yet
Bioinformatics: Missing Value Estimation Methods For DNA Microarrays
6 pages
MICE
No ratings yet
MICE
4 pages
platias2020-Greece
No ratings yet
platias2020-Greece
10 pages
Statistical Strategies For Avoiding False Discoveries in Metabolomics and Related Experiments - 2007 - Broadhurst, Kell
No ratings yet
Statistical Strategies For Avoiding False Discoveries in Metabolomics and Related Experiments - 2007 - Broadhurst, Kell
26 pages
Advanced Sampling Techniques: (Errors in Sample Survey)
No ratings yet
Advanced Sampling Techniques: (Errors in Sample Survey)
7 pages
Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling
No ratings yet
Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling
8 pages
Handling Data Gaps in Time Series Using Imputation Presentation
No ratings yet
Handling Data Gaps in Time Series Using Imputation Presentation
42 pages
Ijctt V3i2p104
No ratings yet
Ijctt V3i2p104
5 pages
meth_2024_part3_imput
No ratings yet
meth_2024_part3_imput
32 pages
A Comparison of Six Methods For Missing Data Imputation 2155 6180 1000224 PDF
No ratings yet
A Comparison of Six Methods For Missing Data Imputation 2155 6180 1000224 PDF
6 pages
25 Prasannajit Dash MicroarrayGeneExpression
No ratings yet
25 Prasannajit Dash MicroarrayGeneExpression
6 pages
RJwrapper
No ratings yet
RJwrapper
24 pages
BMC Genetics: Imputation Methods For Missing Data For Polygenic Models
No ratings yet
BMC Genetics: Imputation Methods For Missing Data For Polygenic Models
4 pages
Pang 2020
No ratings yet
Pang 2020
14 pages
Predict and Impute Missing Values in Diabetes Dataset Using OSICM and SVM
No ratings yet
Predict and Impute Missing Values in Diabetes Dataset Using OSICM and SVM
25 pages
jds1135
No ratings yet
jds1135
13 pages
Test
No ratings yet
Test
4 pages
603-8-1 Donders - J Clin Epidemiol 2006 v59 n10 p1087-91
No ratings yet
603-8-1 Donders - J Clin Epidemiol 2006 v59 n10 p1087-91
5 pages
DT - Missing Values
No ratings yet
DT - Missing Values
11 pages
A Method For Missing Values Imputation of Machine Learning Datasets
No ratings yet
A Method For Missing Values Imputation of Machine Learning Datasets
11 pages
V Rin (Vae RNN)
No ratings yet
V Rin (Vae RNN)
11 pages
8 Hron Et Al 2010
No ratings yet
8 Hron Et Al 2010
13 pages
Neurocomputing: Vadlamani Ravi, Mannepalli Krishna
No ratings yet
Neurocomputing: Vadlamani Ravi, Mannepalli Krishna
8 pages
Statistics in Metabolomics: David Banks Isds Duke University
No ratings yet
Statistics in Metabolomics: David Banks Isds Duke University
41 pages
A New Variable Importance Measure for Random Forests With Missing Data;Hapfelmeir,;Statistics and Computing
No ratings yet
A New Variable Importance Measure for Random Forests With Missing Data;Hapfelmeir,;Statistics and Computing
14 pages
Junger 2015
No ratings yet
Junger 2015
9 pages
Machine Learning Techniques Lesson 1
No ratings yet
Machine Learning Techniques Lesson 1
9 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
Recurrent Neural Networks For Multivariate Time Series With Missing Values
No ratings yet
Recurrent Neural Networks For Multivariate Time Series With Missing Values
12 pages
Schafer SMMR 1999 MI Primer
No ratings yet
Schafer SMMR 1999 MI Primer
14 pages
Supplementary Section S2
No ratings yet
Supplementary Section S2
7 pages
Missing Values
No ratings yet
Missing Values
16 pages
handling missing values
No ratings yet
handling missing values
5 pages
Mice vs Ppca
No ratings yet
Mice vs Ppca
8 pages
Imputation
No ratings yet
Imputation
10 pages
Smart Business Problems and Analytical Hints in Cancer Research
From Everand
Smart Business Problems and Analytical Hints in Cancer Research
Zemelak Goraga
No ratings yet
PRS PDF
No ratings yet
PRS PDF
22 pages
Genetic Aetiology of Glycaemic Traits: Approaches and Insights
No ratings yet
Genetic Aetiology of Glycaemic Traits: Approaches and Insights
13 pages
Programming and Statistics in R Exercise 5
No ratings yet
Programming and Statistics in R Exercise 5
2 pages
Missing Data Review
No ratings yet
Missing Data Review
31 pages
Programming and Statistics in R Exercise 4
No ratings yet
Programming and Statistics in R Exercise 4
2 pages
Introduction
No ratings yet
Introduction
4 pages
Exercises 2
No ratings yet
Exercises 2
2 pages
Terminated List 30-11-2018
100% (1)
Terminated List 30-11-2018
433 pages
TXD
No ratings yet
TXD
52 pages
Shop Safety Rules and Practices Tve 7 Module
No ratings yet
Shop Safety Rules and Practices Tve 7 Module
13 pages
Lecture 1 - Def-Characteristics-Evolution of Entrep
No ratings yet
Lecture 1 - Def-Characteristics-Evolution of Entrep
59 pages
ECE102 SemiconDevL21 MosfetDrainCurrentModelAndCircuitAnal
No ratings yet
ECE102 SemiconDevL21 MosfetDrainCurrentModelAndCircuitAnal
25 pages
Bulletin 104 EI 1584 4th Edition
No ratings yet
Bulletin 104 EI 1584 4th Edition
2 pages
Ostler v. Anderson, 10th Cir. (2006)
No ratings yet
Ostler v. Anderson, 10th Cir. (2006)
8 pages
Hudson Litchfield News 12-18-2009
No ratings yet
Hudson Litchfield News 12-18-2009
18 pages
Authors Contribution Form
No ratings yet
Authors Contribution Form
2 pages
VOPAK-Reatile AQIA 13614921-13289-4 - VSAD - RB - DEIAR - Rev0-Ph - 1 PDF
No ratings yet
VOPAK-Reatile AQIA 13614921-13289-4 - VSAD - RB - DEIAR - Rev0-Ph - 1 PDF
69 pages
Satellite Communication
No ratings yet
Satellite Communication
19 pages
Fuel Testing Kit (Liqui-Cult Microbial Activity)
No ratings yet
Fuel Testing Kit (Liqui-Cult Microbial Activity)
1 page
Presentation of Electrical Machines
No ratings yet
Presentation of Electrical Machines
11 pages
Land Degradation in Palestine
No ratings yet
Land Degradation in Palestine
16 pages
Daily Updation Sheet- Bridge Works 27.12.2024
No ratings yet
Daily Updation Sheet- Bridge Works 27.12.2024
2 pages
Presentation 1
No ratings yet
Presentation 1
16 pages
58622rmo11 25
No ratings yet
58622rmo11 25
1 page
SYED_IMDAD_ALI_SHAH_BTK-020226 (2)
No ratings yet
SYED_IMDAD_ALI_SHAH_BTK-020226 (2)
1 page
Site Management & Practice: Civ4101 Civil Engineering Management
No ratings yet
Site Management & Practice: Civ4101 Civil Engineering Management
20 pages
Midterm
No ratings yet
Midterm
5 pages
Job Safety Analysis JSA: Manual Handling Could You Be Injured by
No ratings yet
Job Safety Analysis JSA: Manual Handling Could You Be Injured by
2 pages
Timeline of Pamet Presidents
No ratings yet
Timeline of Pamet Presidents
20 pages
Communication System II: Lecture-4 Angle Modulation
No ratings yet
Communication System II: Lecture-4 Angle Modulation
23 pages
Full Download of Solutions for Intermediate Accounting 15th Edition by Kieso in PDF DOCX Format
100% (18)
Full Download of Solutions for Intermediate Accounting 15th Edition by Kieso in PDF DOCX Format
46 pages
Yankee Fork and Hoe Company1
No ratings yet
Yankee Fork and Hoe Company1
18 pages
SH1016
No ratings yet
SH1016
20 pages
Chapter 5: Consumer Markets and Consumer Buyer Behavior: Kira Higgins Jackie Reyes Roberto Ornelas Paul Lara
No ratings yet
Chapter 5: Consumer Markets and Consumer Buyer Behavior: Kira Higgins Jackie Reyes Roberto Ornelas Paul Lara
19 pages
CV Erdenechimeg Dorjbaatar
No ratings yet
CV Erdenechimeg Dorjbaatar
3 pages

Gsimp: A Gibbs Sampler Based Left-Censored Missing Value Imputation Approach For Metabolomics Studies

Uploaded by

Gsimp: A Gibbs Sampler Based Left-Censored Missing Value Imputation Approach For Metabolomics Studies

Uploaded by

bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410.

The copyright holder for this preprint

GSimp: A Gibbs sampler based left-censored missing value

Results: We have developed an iterative Gibbs sampler based left-censored missing

Availability and implementation: The R code for GSimp, evaluation pipeline,

To reduce adverse effects caused by missing values during metabolomics data

2.3 Prediction model

The L2 penalty 1  /2  improves the model’s robustness by controlling the

Step-4 (sampling): we draw sample ( ೔ from a truncated normal distribution

)' ೔ , % * +,, -. / for ith missing element and update ".

lower limits L (-∞ for non-missing variables) with length p.

imputed part (original missing part) with length m and "′

2.7 Assessments of performance

NRMSE-based sum of ranks (SOR), principal component analysis (PCA)-Procrustes

different imputation methods in terms of detecting differential variables.

3.2 Imputation comparisons

We evaluated four different MNAR imputation/substitution methods on FFA, BA

Then, we measured the imputation performances with binary labels provided. We

The purpose of this study is to develop a left-censored missing value imputation

Since GSimp employed an iterative Gibbs sampler method, a large number of

Liu,J.-J. et al. (2016) Profiling of plasma metabolites suggests altered mitochondrial

interpretation. Nucleic Acids Res., 37.

Fig. 1. Sequentially parameters updating in GSimp

Fig. 2. Evaluations of different imputation methods using unlabeled approaches

Fig. 3. Evaluations of different imputation methods using labeled approaches

Pearson's correlation between log-transformed p-values of student’s t-tests on FFA

Figure 4. Evaluations of different imputation methods using TPR for various

Figure 5. Comparisons of imputed values and original values on an example

Supplemental Figure 1. Evaluations of different imputation methods on

Supplemental Figure 2. Evaluations of different numbers of iterations using

Number of missing variables

Number of missing variables

Number of missing variables

Number of missing variables

Number of missing variables

Imputed Value Imputed Value

Imputed Value Imputed Value

You might also like

The L2 penalty 1 /2 improves the model’s robustness by controlling the

Step-4 (sampling): we draw sample (೔ from a truncated normal distribution

)'೔ , % * +,, -. / for ith missing element and update ".

imputed part (original missing part) with length m and "′