0% found this document useful (0 votes)
45 views

Gsimp: A Gibbs Sampler Based Left-Censored Missing Value Imputation Approach For Metabolomics Studies

Uploaded by

Mila Anasanti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Gsimp: A Gibbs Sampler Based Left-Censored Missing Value Imputation Approach For Metabolomics Studies

Uploaded by

Mila Anasanti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410.

The copyright holder for this preprint


(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

GSimp: A Gibbs sampler based left-censored missing value


imputation approach for metabolomics studies

Runmin Wei1, 2, #, Jingye Wang1, #*, Erik Jia3, Tianlu Chen4, Yan Ni1, Wei Jia1

1
University of Hawaii Cancer Center, Honolulu, HI 96813, USA
2
Department of Molecular Biosciences and Bioengineering, University of Hawaii at
Manoa, Honolulu, HI 96822, USA
3
Punahou School, Honolulu, HI 96822, USA
4
Shanghai Key Laboratory of Diabetes Mellitus and Center for Translational
Medicine, Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, Shanghai
200233, China

#
These authors contributed equally to this work.
*
Corresponding Author: Jingye Wang, MD, MPH, Email: [email protected]
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

ABSTRACT
Motivation: Left-censored missing values commonly exist in targeted metabolomics
datasets and can be considered as missing not at random (MNAR). Improper data
processing procedures for missing values will cause adverse impacts on subsequent
statistical analyses. However, few imputation methods have been developed and
applied to the situation of MNAR in the field of metabolomics. Thus, a practical
left-censored missing value imputation method is urgently needed.

Results: We have developed an iterative Gibbs sampler based left-censored missing


value imputation approach (GSimp). We compared GSimp with other three
imputation methods on two real-world targeted metabolomics datasets and one
simulation dataset using our imputation evaluation pipeline. The results show that
GSimp outperforms other imputation methods in terms of imputation accuracy,
observation distribution, univariate and multivariate analyses, and statistical
sensitivity.

Availability and implementation: The R code for GSimp, evaluation pipeline,


vignette, real-world and simulated targeted metabolomics dataset are available on:
https://github.com/WandeRum/GSimp.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

1 INTRODUCTION
Missing values are commonly observed in mass spectrometry (MS) based
metabolomics datasets. Many statistical methods require a complete dataset, which
makes missing data an inevitable problem for subsequent data analysis. Generally,
there are three types of missing values, missing not at random (MNAR), missing at
random (MAR) and missing completely at random (MCAR) (Gelman and Hill, 2006;
Little and Rubin, 2002). Unexpected missing values are considered as MCAR if they
originate from random errors and stochastic fluctuations during the data acquisition
process (e.g., incomplete derivatization or ionization). MAR assumes the probability
of a variable being missing depends on other observed variables (Gelman and Hill,
2006; Little and Rubin, 2002). Thus, missing values due to suboptimal data
preprocessing, e.g., inaccurate peak detection and deconvolution of co-eluting
compounds can be defined as MAR. Targeted metabolomics studies have been widely
used for the accurate quantification of specific groups of metabolites. Due to the limit
of compound quantifications (LOQ), missing values are usually caused by signal
intensities lower than LOQ, also known as left-censored missing, which can be
assigned to MNAR.

The processing of missing values has been developed and studied in MS data, which
is an indispensable step in the metabolomics data processing pipeline (Hrydziuszko
and Viant, 2012). One simple but naïve solution is the substitution of missing by
determined values, such as zero, half of the minimum value (HM) or LOQ/c where c
denotes a positive integer. Determined value substitutions, although commonly
applied for dealing with missing values in metabolomics studies (Guo et al., 2015;
Liu et al., 2016; Butte et al., 2015), can significantly affect the subsequent statistical
analyses in different ways, e.g. underestimate variances of missing variables, decrease
statistical power, fabricate pseudo-clusters among observations, etc. (Gelman and Hill,
2006). Advanced statistical imputation methods have been developed for –omics
studies, e.g., k-nearest neighbors (kNN) imputation (Troyanskaya et al., 2001),
singular value decomposition (SVD) imputation (Hastie et al., 1999; Stacklies et al.,
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

2007), random forest (RF) imputation (Stekhoven and Bühlmann, 2012). Several
metabolomics data analysis software tools provide different methods of dealing with
missing values (Mak et al., 2014; Katajamaa et al., 2006; Kessler et al., 2013;
Luedemann et al., 2012; Xia et al., 2015). MetaboAnalyst (Xia et al., 2009, 2012,
2015), one widely used metabolomics analysis toolkit, provides Probabilistic PCA
(PPCA), Bayesian PCA (BPCA) and SVD imputation. However, these methods are
mainly aiming at imputing MCAR/MAR and not suitable for the situation of MNAR.
A limited number of approaches dealing with left-censored missing values were
applied by researchers (Lazar et al., 2016; Shah et al., 2017). Quantile regression
approach for left-censored missing (QRILC) imputes missing data using random
draws from a truncated distribution with parameters estimated using quantile
regression (Lazar et al., 2016). Although this imputation keeps the overall distribution
of missing parts compared to determined value substitutions, it may produce random
results since no more information is used for the prediction of missing parts. Another
imputation method recently developed for MNAR is k-nearest neighbor truncation
(kNN-TN) by Shah, et al. (Shah et al., 2017). This approach applies Maximum
Likelihood Estimators (MLE) for the means and standard deviations of missing
variables based on truncated normal distribution. Then a Pearson correlation based
kNN imputation method was implemented on standardized data. Although the author
stated that kNN-TN could impute both MNAR and MAR, the imputed values were
entirely dependent on the nearest neighbors while no constraint was placed upon the
imputation. Thus, this approach might cause an overestimation of missing values.

To reduce adverse effects caused by missing values during metabolomics data


analyses, we developed a left-censored missing value imputation framework, GSimp,
where a prediction model was embedded in an iterative Gibbs sampler. We then
compared GSimp with HM, QRILC, and kNN-TN on two real-world metabolomics
datasets and one simulation dataset to demonstrate the advantages of GSimp
regarding imputation accuracy, observation distribution, univariate analysis,
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

multivariate analysis and sensitivity. Our findings indicate that GSimp is a robust
method to handle left-censored missing values in targeted metabolomics studies.

2 METHODS
2.1 Dataset
Diabetes datasets
We employed datasets from a study of comparing serum metabolites between obese
subjects with diabetes mellitus (N=70) and healthy controls (N=130) where N
represents the number of observations. Dataset 1: a total of 42 free fatty acids (FFAs)
were identified and quantified in those participants in order to evaluate their FFA
profiles (Ni et al., 2015). Dataset 2: a total of 34 bile acids (BAs) were identified and
quantified in a similar way using different analytical protocol (Lei et al., 2017).
Simulation dataset
For the simulation dataset, we first calculated the covariance matrix Cov based on the
whole diabetes dataset (P=76) where P represents the number of variables. Then we
generated two separated data matrices with the same number of 80 observations from
multivariate normal distributions, representing two different biological groups. For
each data matrix, the sample mean of each variable was drawn from a normal
distribution 0, 0.5  and Cov was kept using SVD. Then, two data matrices were
horizontally (column-wise) stacked together as a complete data matrix (N×P=160×76)
so that group differences were simulated and covariance was kept.
2.2 MNAR generation
For two real-world targeted metabolomics datasets, we generated a series of MNAR
datasets by using the missing proportion (number of missing variables/number of total
variables) from 0.1 to 0.6 in a step of 0.05 with MNAR cut-off for each missing
variable drawn from a uniform distribution  0.1, 0.5. The elements lower than the
corresponding cut-off were removed and replaced with NA. For the simulation dataset,
we generated a series of MNAR datasets by using the missing proportion from 0.1 to
0.8 step by 0.1 with MNAR cut-off drawn from  0.3, 0.6 for a more rigorous
testing.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

2.3 Prediction model


A prediction model was employed for the prediction of missing values by setting a
targeted missing variable as outcome and other variables as predictors. Different
prediction models, e.g., linear regression, elastic net (Zou and Hastie, 2005),
regression trees (Breiman et al., 1984) and random forest (Breiman, 2001), etc. could
be embedded in our imputation framework. Elastic net was applied in our approach as
an ideal prediction model considering its stability, accuracy, and efficiency. This
model is a regularized regression with the combination of L1 and L2 penalties of the
LASSO (Tibshirani, 1996) and ridge (Hoerl and Kennard, 1970) methods. The
estimates of regression coefficients in elastic net are defined as
  argmin     1  /2     


The L2 penalty 1  /2  improves the model’s robustness by controlling the


multicollinearities among variables which are widely existed in high-dimensional
–omics data. And the L1 penalty   controls the number of predictors by
assigning zero coefficients to the "unnecessary" predictors. From a Bayesian point of
view, the regularization is a mixture of Gaussian and Laplacian prior distributions of
coefficients which can pull the full model of maximum likelihood estimates
    towards the null model of prior coefficients distribution, thus controls the
risk of overfitting and increase the model robustness. R package glmnet was used for
the elastic net. We set hyperparameters  as 0.01 (default setting for
highly-dimensional data) and  as 0.5 (an equally mixture of LASSO and ridge
penalties) (Friedman et al., 2015).
2.4 Gibbs sampler
Gibbs sampler is a Markov Chain Monte Carlo (MCMC) technique that sequentially
updates parameters while others are fixed. It can be used to generate posterior
samples. For each missing variable in the dataset, we applied a Gibbs sampler to
impute the missing values by sampling from a truncated normal distribution with
prediction model fitted value as mean and root mean square deviation (RMSD) of
missing part as standard deviation while truncated by specified cut-points. Assuming
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

we have a n × p data matrix X = (X1, X2, X3, …, Xp) with only one variable Xj
containing left-censored missing values. We denote Xj as y and the missing part as ym
with length m and non-missing part as yf with length f, and the rest of matrix X-j as ! .
We can then set the lower truncation point lo as -∞ (centralized data) or 0 (original
data) and upper hi as the minimum value of yf or a given LOQ. The truncation bounds
ensure imputation results are constrained within [lo, hi]. Then, the Gibbs sampler
approach can be described as following steps:
Step-1 (initialization): we initialize missing values (QRILC in our case), and get ";
Step-2 (prediction): we then build a prediction model (elastic net in our case):
" ~ !;
Step-3 (estimation): based on the prediction model, we get the predicted value "$ and
೘ ᇲ మ
the root mean square deviation (RMSD) of missing part % = &∑೔సభ೘ ೔ ೘೔ where

  ೔ and ' ೔ are ith initialized/imputed value and fitted value respectively;

Step-4 (sampling): we draw sample ( ೔ from a truncated normal distribution

)' ೔ , % * +,, -. / for ith missing element and update ".


We iteratively repeat step-2 to step-4 and update Xj.
2.5 GSimp framework
A whole data matrix X = (X1, X2, X3, …, Xp) contains a number of k (k ≤ p)
left-censored missing variables. We present our imputation framework as following
algorithm.
Algorithm: Gibbs sampler based left-censored missing value imputation approach

Require: X an n × p data matrix, iters_all the number of iterations for imputing the
whole matrix X, iters_each the number of iterations for imputing each missing
variable, a vector of upper limits U (+∞ for non-missing variables) and a vector of

lower limits L (-∞ for non-missing variables) with length p.


1. !  initialize the missing values for X;
2. K vector of indices of missing variables in X with increasing amount of
missing values;
3. for 1:iters_all do
4. for j in K do
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

5. " ′
!  , " ′
can be divided into two parts: " ′

is a vector of the

imputed part (original missing part) with length m and "′

is a vector
of the non-missing part with length f while n = m + f;

6. ! ′
! 
 , represents the matrix X with jth column removed;
7. lo Lj and hi Uj;
8. for 1:iters_each do
9. Gibbs sampler step 2 to 4;
10. end for
11. Update !  ;
12. end for
13. end for
14. return ! 
2.6 Other imputation approaches
Other three left-censored missing imputation/substitution methods were conducted in
our study for performance comparison:
• kNN-TN (Truncation k-nearest neighbors imputation) (Shah et al., 2017): this
method applied a Newton-Raphson (NR) optimization to estimate the truncated
mean and standard deviation. Then, Pearson correlation was calculated based on
standardized data followed by correlation-based kNN imputation.
• QRILC (Quantile Regression Imputation of Left-Censored data) (Lazar, 2015):
this method imputes missing elements randomly drawing from a truncated
distribution estimated by a quantile regression. R package imputeLCMD was
applied for this imputation approach.
• HM (Half of the Minimum): This method replaces missing elements with half of
the minimum of non-missing elements in the corresponding variable.

2.7 Assessments of performance


The assessments of imputation performance were conducted using an imputation
evaluation pipeline from our previous study with both unlabeled and labeled
measurements (Wei et al., 2017), which is accessible through:
https://github.com/WandeRum/MVI-evaluation. Unlabeled measurements include the
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

NRMSE-based sum of ranks (SOR), principal component analysis (PCA)-Procrustes


analysis while labeled measurements include correlation analysis for univariate results,
partial least square (PLS)-Procurstes analysis. R package vegan was applied for
Procrustes analysis (Oksanen, 2015) and ropls was applied for PLS analysis
(Thévenot et al., 2015).
Furthermore, we evaluated the impacts of different imputation methods on the
statistical sensitivity of detecting biological variances. On the simulation dataset, we
calculated p-values from student’s t-tests between two groups from original as well as
imputed datasets. We marked a set S as real differential variables at a significant level
of p-cutoff (e.g. 0.05) from original simulation data, and a set S’ as detected
differential variables at the same significant level from imputed simulation data. Then
#     ᇲ
we calculated the true positive rate 012  #  
to evaluate the effects of

different imputation methods in terms of detecting differential variables.


bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

3 RESULTS
3.1 Gibbs sampler in GSimp

A variable containing missing elements from FFA dataset was randomly selected to
track the sequence of corresponding parameters and estimates across the first 500
iterations out of a total of 2000 (100 × 20) iterations using GSimp. From Figure 1, we
can observe that both fitted value ' and sample value ( reach to the convergence
after iterations and the standard deviation estimate % drop to a steady state with
small values. In addition, an upper constraint for the distribution of ( indicated that
it was drawn from a truncated normal distribution.

3.2 Imputation comparisons

We evaluated four different MNAR imputation/substitution methods on FFA, BA


targeted metabolomics and simulation datasets. First, we measured the imputation
performances using label-free approaches. SOR was used to measure the imputation
accuracy regarding the imputed values of each missing variable. From the upper panel
of Figure 2, we can observe that GSimp has the best performance with the lowest
SOR across all varying numbers of missing variables in both FFA and BA datasets. To
measure the extent of imputation induced distortion on observation distributions, the
PCA-Procrustes analysis was conducted between the original data and imputed data.
The lower panel of Figure 2 shows that GSimp has the lowest Procrustes sum of
squared errors compared to other methods, which means GSimp kept the overall
observation distribution of original dataset with the least distortions.

Then, we measured the imputation performances with binary labels provided. We


compared the results of univariate and multivariate analyses for imputed and original
datasets. Since this is a case-control study, student’s t-tests were applied for univariate
analyses. Then we compared the results by calculating Pearson’s correlation between
log-transformed p-values calculated from imputed and original data for missing
variables. Again, GSimp performs best with the highest correlations among four
methods (upper panel of Figure 3) along with different numbers of missing variables,
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

and it implies GSimp keeps the most biological variations regarding the univariate
analyses results. For the multivariate analyses, we applied PLS-DA to distinguish the
group differences. Similarly, we conducted PLS-Procrustes analysis while PLS was
employed as a supervised dimension reduction technique. The lower panel of Figure 3
demonstrates that GSimp preferably restores the original observation distribution with
the lowest Procrustes sum of squared errors among four imputation methods.

On the simulation dataset, we compared QRILC, kNN-TN, and GSimp using same
approaches. Consistent results were recognized (Supplemental figure 1), and GSimp
presents the best performances on the simulation dataset with the lowest SOR and
PCA/PLS-Procrustes sum of squared errors and the highest correlation of univariate
analysis results. Moreover, to examine the influences of statistical power using
different imputation methods, we calculated TPR as the capacities to detect
differential variables on different imputation datasets. Again, with both p-cutoff of
0.05 and 0.01, GSimp shows the overall highest TPR over different missing numbers
(Figure 4). This implies that GSimp impairs the sensitivity to the least extent among
three methods, which is reasonable since GSimp also keeps the highest correlation of
p-values in previous comparisons.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

4 DISCUSSION

The purpose of this study is to develop a left-censored missing value imputation


approach for targeted metabolomics data analysis. We evaluated GSimp with other
three imputation methods (a.k.a kNN-TN, QRILC, and HM) and suggested that
GSimp was superior to others using different evaluation methods. To illustrate the
performance of GSimp, we randomly selected one variable containing missing values
from FFA dataset (Figure 5) to compare the imputed values and original values.
Although determined value substitution (e.g. HM) were widely used by researchers in
the field of metabolomics, our results indicated that HM could severely distort the
data distribution (upper left panel of Figure 5), thus impairing subsequent analyses. In
comparison, QRILC kept the overall data distribution and variances (upper right panel
of Figure 5). However, random values could be generated by this approach since
QRILC imputes each missing variable independently without utilizing the predictive
information from other variables. Statistical learning based method, kNN-TN, applied
a correlation based kNN algorithm with parameters of missing variables estimated
with truncated normal distributions. This method utilized the information of highly
correlated variables of targeted missing variable, thus kept a linear trend between
original values and imputed values. However, since no constraint was applied for the
imputation, a right shift of missing part might occur, causing imputed values to
exceed the truncation point (lower left panel of Figure 5). In contrast, GSimp utilized
the predictive information of other variables by employing a prediction model and
held a truncated normal distribution for each missing element simultaneously, which
ensured a favorable linear trend between imputed and original values as well as a
reasonable bound for the imputed values (lower right panel of Figure 5).

In our approach, truncated normal distribution was used for the constraint of
imputation results in Gibbs sampler steps. We applied the minimum observed value of
missing variable as an informative upper truncation point and -∞ as a non-informative
lower truncation point considering the situation of left-censored missing. Other values
could also be applied in real-world metabolomics analyses, such as a known LOQ of a
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

metabolite can be set as an upper truncation point. Additionally, when signal intensity
of certain compound is larger than the upper limit of quantification range or saturation
during instrument analysis, an informative lower truncation point could be
correspondingly be applied for the right-censored missing value. What’s more, when
non-informative bounds for both upper and lower limits (e.g., +∞, -∞) were applied,
our GSimp could be extended to the situation of MCAR/MAR. With the flexible
usage of upper and lower limits, our approach may provide a versatile and powerful
imputation technique for different missing types. Other –omics datasets with missing
values (especially MNAR), e.g. single cell RNA-sequencing data, could apply this
method with few modifications of our default settings. Thus, it is worthy to evaluate
our approach, GSimp, in other complex scenarios in the future.

Since GSimp employed an iterative Gibbs sampler method, a large number of


iterations (iters_all=20, iters_each=100) are preferable for the convergence of
parameters. However, as we tested on the simulation dataset with different number of
iterations, a much less iterations (iters_all=10, iters_each=50) won't severely affect
the imputation accuracy (Supplemental figure 2). Among iterations for the whole data
matrix, we applied a sequential imputation procedure for missing variables from the
least number of missing values to the most. Such sequential approach improves
imputation performances compared to parallel imputation approach.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

5 CONCLUSION
A practical left-censored missing value imputation method is needed in the field of
metabolomics. We develop a new imputation approach GSimp that outperforms
traditional determined value substitution method (HM) and other approaches (QRILC,
and kNN-TN) for MNAR situations. GSimp utilized predictive information of
variables and held a truncated normal distribution for each missing element
simultaneously via embedding a prediction model into the Gibbs sampler framework.
With proper modifications on the parameter settings, e.g. truncation points, GSimp
may be applicable to handle different types of missing values and in different -omics
studies, thus deserved to be further explored in the future.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

REFERENCES
Breiman,L. et al. (1984) Classification and Regression Trees.
Breiman,L. (2001) Random forests. Mach. Learn., 45, 5–32.
Butte,N.F. et al. (2015) Global metabolomic profiling targeting childhood obesity in
the Hispanic population. Am. J. Clin. Nutr., 102, 256–267.
Friedman,A.J. et al. (2015) Lasso and Elastic-Net Regularized Generalized Linear
Models. Available online
https//cran.r-project.org/web/packages/glmnet/glmnet.pdf. (Verified 29 July.
2015).
Gelman,A. and Hill,J. (2006) Data analysis using regression and
multilevel/hierarchical models.
Guo,L. et al. (2015) Plasma metabolomic profiles enhance precision medicine for
volunteers of normal health. Proc. Natl. Acad. Sci., 112, E4901–E4910.
Hastie,T. et al. (1999) Imputing missing data for gene expression arrays. Tech. Report,
Div. Biostat. Stanford Univ., 1–9.
Hoerl,A.E. and Kennard,R.W. (1970) Ridge Regression: Biased Estimation for
Nonorthogonal Problems. Technometrics, 12, 55–67.
Hrydziuszko,O. and Viant,M.R. (2012) Missing values in mass spectrometry based
metabolomics: An undervalued step in the data processing pipeline.
Metabolomics, 8, 161–174.
Katajamaa,M. et al. (2006) MZmine: toolbox for processing and visualization of mass
spectrometry based molecular profile data. Bioinformatics, 22, 634–6.
Kessler,N. et al. (2013) MeltDB 2.0-advances of the metabolomics software system.
Bioinformatics, 29, 2452–2459.
Lazar,C. et al. (2016) Accounting for the Multiple Natures of Missing Values in
Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.
J. Proteome Res., 15, 1116–1125.
Lei,S. et al. (2017) The ratio of dihomo-γ-linolenic acid to deoxycholic acid species is
a potential biomarker for the metabolic abnormalities in obesity. FASEB J.,
fj.201700055R.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

Liu,J.-J. et al. (2016) Profiling of plasma metabolites suggests altered mitochondrial


fuel usage and remodelling of sphingolipid metabolism in individuals with type 2
diabetes and kidney disease. Kidney Int. Reports, 2, 470–480.
Luedemann,A. et al. (2012) TagFinder: Preprocessing software for the fingerprinting
and the profiling of gas chromatography-mass spectrometry based metabolome
analyses. Methods Mol. Biol., 860, 255–286.
Mak,T.D. et al. (2014) MetaboLyzer: A novel statistical workflow for analyzing
postprocessed LC-MS metabolomics data. Anal. Chem., 86, 506–513.
Ni,Y. et al. (2015) Circulating Unsaturated Fatty Acids Delineate the Metabolic Status
of Obese Individuals. EBioMedicine, 2, 1513–1522.
Oksanen,J. (2015) Multivariate Analysis of Ecological Communities in R: vegan
tutorial.
Shah,J.S. et al. (2017) Distribution based nearest neighbor imputation for truncated
high dimensional data with applications to pre-clinical and clinical metabolomics
studies. BMC Bioinformatics, 18, 114.
Stacklies,W. et al. (2007) pcaMethods - A bioconductor package providing PCA
methods for incomplete data. Bioinformatics, 23, 1164–1167.
Stekhoven,D.J. and Bühlmann,P. (2012) Missforest-Non-parametric missing value
imputation for mixed-type data. Bioinformatics, 28, 112–118.
Thévenot,E.A. et al. (2015) Analysis of the Human Adult Urinary Metabolome
Variations with Age, Body Mass Index, and Gender by Implementing a
Comprehensive Workflow for Univariate and OPLS Statistical Analyses. J.
Proteome Res., 14, 3322–3335.
Tibshirani,R. (1996) Regression Selection and Shrinkage via the Lasso. J. R. Stat. Soc.
B, 58, 267–288.
Troyanskaya,O. et al. (2001) Missing value estimation methods for DNA microarrays.
Bioinformatics, 17, 520–525.
Wei,R. et al. (2017) Missing Value Imputation Approach for Mass
Spectrometry-based Metabolomics Data. bioRxiv.
Xia,J. et al. (2009) MetaboAnalyst: A web server for metabolomic data analysis and
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

interpretation. Nucleic Acids Res., 37.


Xia,J. et al. (2012) MetaboAnalyst 2.0-a comprehensive server for metabolomic data
analysis. Nucleic Acids Res., 40.
Xia,J. et al. (2015) MetaboAnalyst 3.0-making metabolomics more meaningful.
Nucleic Acids Res., 43, W251–W257.
Zou,H. and Hastie,T. (2005) Regularization and variable selection via the elastic net. J.
R. Stat. Soc. Ser. B Stat. Methodol., 67, 301–320.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

Figure legends

Fig. 1. Sequentially parameters updating in GSimp

The first 500 iterations out of a total of 2000 (100×20) iterations using GSimp where
' , ( and σ represent fitted value, sample value and standard deviation
correspondingly.

Fig. 2. Evaluations of different imputation methods using unlabeled approaches

SOR on FFA dataset (upper left) and BA dataset (upper right) along with different
numbers of missing variables based on four imputation methods: HM (red circle),
QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross).
PCA-Procrustes sum of squared errors on FFA dataset (lower left) and BA dataset
(lower right) along with different numbers of missing variables based on four
imputation methods: HM (red circle), QRILC (green triangle), GSimp (blue square),
and kNN-TN (purple cross).

Fig. 3. Evaluations of different imputation methods using labeled approaches

Pearson's correlation between log-transformed p-values of student’s t-tests on FFA


dataset (upper left) and BA dataset (upper right) along with different numbers of
missing variables based on four imputation methods: HM (red circle), QRILC (green
triangle), GSimp (blue square), and kNN-TN (purple cross). PLS-Procrustes sum of
squared errors on FFA dataset (lower left) and BA dataset (lower right) along with
different numbers of missing variables based on four imputation methods: HM (red
circle), QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross).

Figure 4. Evaluations of different imputation methods using TPR for various


p-cutoffs on simulation dataset

TPR along with different numbers of missing variables based on three imputation
methods: QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross)
among different p-cutoff=0.05 (left panel), and 0.01 (right panel).
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.

Figure 5. Comparisons of imputed values and original values on an example


variable

Scatter plots of imputed values (X-axis) and original values (Y-axis) on one example
missing variable while non-missing elements represented as blue dots and missing
elements as red dots based on four imputation methods: HM (upper left), QRILC
(upper right), kNN-TN (lower left), and GSimp (lower right). Rug plots show the
distributions of imputed values and original values.

Supplemental Figure 1. Evaluations of different imputation methods on


simulation dataset

SOR (upper left), PCA-Procrustes sum of squared errors (upper right), Pearson's
correlation between log-transformed p-values of student’s t-tests (lower left), and
PLS-Procrustes sum of squared errors (lower right) on simulation dataset along with
different numbers of missing variables based on three imputation methods: QRILC
(green triangle), GSimp (blue square), and kNN-TN (purple cross).

Supplemental Figure 2. Evaluations of different numbers of iterations using


GSimp on simulation dataset

SOR on simulation dataset along with different numbers of missing variables based
on four different numbers of iterations: iters_each=50 and iters_all=20 (red circle),
iters_each=100 and iters_all=20 (green triangle), iters_each=50 and iters_all=10
(blue square), iters_each=100 and iters_all=10 (purple cross).
0.2

0.1

ỹ -0.1

-0.2

-0.3

-0.2
-0.1
0
0.1
0.1
0.2
ŷ 0.2 0.3
0.3 0.4
σ
0.4 0.5

0.6
FFA BA
80
80

60
60

HM
SOR

QRILC
40 GSimp
40
kNN-TN

20 20

0
5 10 15 20 25 5 10 15 20

Number of missing variables


FFA BA

0.015
0.03
PCA-Procrustes SS.

0.010
0.02 HM
QRILC
GSimp
kNN-TN

0.01 0.005

0.00 0.000

5 10 15 20 25 5 10 15 20

Number of missing variables


FFA BA
1.0 1.0

0.9 0.9
Correlation

HM
QRILC
GSimp
kNN-TN

0.8 0.8

0.7 0.7

5 10 15 20 25 5 10 15 20

Number of missing variables


FFA BA

0.06

0.015
PLS-Procrustes SS.

0.04

0.010
HM
QRILC
GSimp
kNN-TN

0.02
0.005

0.000 0.00

5 10 15 20 25 5 10 15 20

Number of missing variables


p-cutoff=0.05 p-cutoff=0.01
1.00

0.9
0.95

QRILC
TPR

TPR
GSimp
0.90 kNN-TN
0.8

0.85

0.7

0.80
20 40 60 20 40 60

Number of missing variables


HM QRILC

125 125

100 100
Original Value

Original Value
75 75

50 50

25 25
25 50 75 100 125 25 50 75 100 125

Imputed Value Imputed Value

kNN-TN GSimp

125 125

100 100
Original Value

Original Value

75 75

50 50

25 25
25 50 75 100 125 25 50 75 100 125

Imputed Value Imputed Value

You might also like