Gsimp: A Gibbs Sampler Based Left-Censored Missing Value Imputation Approach For Metabolomics Studies
Gsimp: A Gibbs Sampler Based Left-Censored Missing Value Imputation Approach For Metabolomics Studies
Runmin Wei1, 2, #, Jingye Wang1, #*, Erik Jia3, Tianlu Chen4, Yan Ni1, Wei Jia1
1
University of Hawaii Cancer Center, Honolulu, HI 96813, USA
2
Department of Molecular Biosciences and Bioengineering, University of Hawaii at
Manoa, Honolulu, HI 96822, USA
3
Punahou School, Honolulu, HI 96822, USA
4
Shanghai Key Laboratory of Diabetes Mellitus and Center for Translational
Medicine, Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, Shanghai
200233, China
#
These authors contributed equally to this work.
*
Corresponding Author: Jingye Wang, MD, MPH, Email: [email protected]
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
ABSTRACT
Motivation: Left-censored missing values commonly exist in targeted metabolomics
datasets and can be considered as missing not at random (MNAR). Improper data
processing procedures for missing values will cause adverse impacts on subsequent
statistical analyses. However, few imputation methods have been developed and
applied to the situation of MNAR in the field of metabolomics. Thus, a practical
left-censored missing value imputation method is urgently needed.
1 INTRODUCTION
Missing values are commonly observed in mass spectrometry (MS) based
metabolomics datasets. Many statistical methods require a complete dataset, which
makes missing data an inevitable problem for subsequent data analysis. Generally,
there are three types of missing values, missing not at random (MNAR), missing at
random (MAR) and missing completely at random (MCAR) (Gelman and Hill, 2006;
Little and Rubin, 2002). Unexpected missing values are considered as MCAR if they
originate from random errors and stochastic fluctuations during the data acquisition
process (e.g., incomplete derivatization or ionization). MAR assumes the probability
of a variable being missing depends on other observed variables (Gelman and Hill,
2006; Little and Rubin, 2002). Thus, missing values due to suboptimal data
preprocessing, e.g., inaccurate peak detection and deconvolution of co-eluting
compounds can be defined as MAR. Targeted metabolomics studies have been widely
used for the accurate quantification of specific groups of metabolites. Due to the limit
of compound quantifications (LOQ), missing values are usually caused by signal
intensities lower than LOQ, also known as left-censored missing, which can be
assigned to MNAR.
The processing of missing values has been developed and studied in MS data, which
is an indispensable step in the metabolomics data processing pipeline (Hrydziuszko
and Viant, 2012). One simple but naïve solution is the substitution of missing by
determined values, such as zero, half of the minimum value (HM) or LOQ/c where c
denotes a positive integer. Determined value substitutions, although commonly
applied for dealing with missing values in metabolomics studies (Guo et al., 2015;
Liu et al., 2016; Butte et al., 2015), can significantly affect the subsequent statistical
analyses in different ways, e.g. underestimate variances of missing variables, decrease
statistical power, fabricate pseudo-clusters among observations, etc. (Gelman and Hill,
2006). Advanced statistical imputation methods have been developed for –omics
studies, e.g., k-nearest neighbors (kNN) imputation (Troyanskaya et al., 2001),
singular value decomposition (SVD) imputation (Hastie et al., 1999; Stacklies et al.,
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
2007), random forest (RF) imputation (Stekhoven and Bühlmann, 2012). Several
metabolomics data analysis software tools provide different methods of dealing with
missing values (Mak et al., 2014; Katajamaa et al., 2006; Kessler et al., 2013;
Luedemann et al., 2012; Xia et al., 2015). MetaboAnalyst (Xia et al., 2009, 2012,
2015), one widely used metabolomics analysis toolkit, provides Probabilistic PCA
(PPCA), Bayesian PCA (BPCA) and SVD imputation. However, these methods are
mainly aiming at imputing MCAR/MAR and not suitable for the situation of MNAR.
A limited number of approaches dealing with left-censored missing values were
applied by researchers (Lazar et al., 2016; Shah et al., 2017). Quantile regression
approach for left-censored missing (QRILC) imputes missing data using random
draws from a truncated distribution with parameters estimated using quantile
regression (Lazar et al., 2016). Although this imputation keeps the overall distribution
of missing parts compared to determined value substitutions, it may produce random
results since no more information is used for the prediction of missing parts. Another
imputation method recently developed for MNAR is k-nearest neighbor truncation
(kNN-TN) by Shah, et al. (Shah et al., 2017). This approach applies Maximum
Likelihood Estimators (MLE) for the means and standard deviations of missing
variables based on truncated normal distribution. Then a Pearson correlation based
kNN imputation method was implemented on standardized data. Although the author
stated that kNN-TN could impute both MNAR and MAR, the imputed values were
entirely dependent on the nearest neighbors while no constraint was placed upon the
imputation. Thus, this approach might cause an overestimation of missing values.
multivariate analysis and sensitivity. Our findings indicate that GSimp is a robust
method to handle left-censored missing values in targeted metabolomics studies.
2 METHODS
2.1 Dataset
Diabetes datasets
We employed datasets from a study of comparing serum metabolites between obese
subjects with diabetes mellitus (N=70) and healthy controls (N=130) where N
represents the number of observations. Dataset 1: a total of 42 free fatty acids (FFAs)
were identified and quantified in those participants in order to evaluate their FFA
profiles (Ni et al., 2015). Dataset 2: a total of 34 bile acids (BAs) were identified and
quantified in a similar way using different analytical protocol (Lei et al., 2017).
Simulation dataset
For the simulation dataset, we first calculated the covariance matrix Cov based on the
whole diabetes dataset (P=76) where P represents the number of variables. Then we
generated two separated data matrices with the same number of 80 observations from
multivariate normal distributions, representing two different biological groups. For
each data matrix, the sample mean of each variable was drawn from a normal
distribution 0, 0.5 and Cov was kept using SVD. Then, two data matrices were
horizontally (column-wise) stacked together as a complete data matrix (N×P=160×76)
so that group differences were simulated and covariance was kept.
2.2 MNAR generation
For two real-world targeted metabolomics datasets, we generated a series of MNAR
datasets by using the missing proportion (number of missing variables/number of total
variables) from 0.1 to 0.6 in a step of 0.05 with MNAR cut-off for each missing
variable drawn from a uniform distribution 0.1, 0.5. The elements lower than the
corresponding cut-off were removed and replaced with NA. For the simulation dataset,
we generated a series of MNAR datasets by using the missing proportion from 0.1 to
0.8 step by 0.1 with MNAR cut-off drawn from 0.3, 0.6 for a more rigorous
testing.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
we have a n × p data matrix X = (X1, X2, X3, …, Xp) with only one variable Xj
containing left-censored missing values. We denote Xj as y and the missing part as ym
with length m and non-missing part as yf with length f, and the rest of matrix X-j as ! .
We can then set the lower truncation point lo as -∞ (centralized data) or 0 (original
data) and upper hi as the minimum value of yf or a given LOQ. The truncation bounds
ensure imputation results are constrained within [lo, hi]. Then, the Gibbs sampler
approach can be described as following steps:
Step-1 (initialization): we initialize missing values (QRILC in our case), and get ";
Step-2 (prediction): we then build a prediction model (elastic net in our case):
" ~ !;
Step-3 (estimation): based on the prediction model, we get the predicted value "$ and
ᇲ మ
the root mean square deviation (RMSD) of missing part % = &∑సభ where
and ' are ith initialized/imputed value and fitted value respectively;
Require: X an n × p data matrix, iters_all the number of iterations for imputing the
whole matrix X, iters_each the number of iterations for imputing each missing
variable, a vector of upper limits U (+∞ for non-missing variables) and a vector of
5. " ′
!
, " ′
can be divided into two parts: "′
is a vector of the
is a vector
of the non-missing part with length f while n = m + f;
6. ! ′
!
, represents the matrix X with jth column removed;
7. lo Lj and hi Uj;
8. for 1:iters_each do
9. Gibbs sampler step 2 to 4;
10. end for
11. Update !
;
12. end for
13. end for
14. return !
2.6 Other imputation approaches
Other three left-censored missing imputation/substitution methods were conducted in
our study for performance comparison:
• kNN-TN (Truncation k-nearest neighbors imputation) (Shah et al., 2017): this
method applied a Newton-Raphson (NR) optimization to estimate the truncated
mean and standard deviation. Then, Pearson correlation was calculated based on
standardized data followed by correlation-based kNN imputation.
• QRILC (Quantile Regression Imputation of Left-Censored data) (Lazar, 2015):
this method imputes missing elements randomly drawing from a truncated
distribution estimated by a quantile regression. R package imputeLCMD was
applied for this imputation approach.
• HM (Half of the Minimum): This method replaces missing elements with half of
the minimum of non-missing elements in the corresponding variable.
3 RESULTS
3.1 Gibbs sampler in GSimp
A variable containing missing elements from FFA dataset was randomly selected to
track the sequence of corresponding parameters and estimates across the first 500
iterations out of a total of 2000 (100 × 20) iterations using GSimp. From Figure 1, we
can observe that both fitted value ' and sample value ( reach to the convergence
after iterations and the standard deviation estimate % drop to a steady state with
small values. In addition, an upper constraint for the distribution of ( indicated that
it was drawn from a truncated normal distribution.
and it implies GSimp keeps the most biological variations regarding the univariate
analyses results. For the multivariate analyses, we applied PLS-DA to distinguish the
group differences. Similarly, we conducted PLS-Procrustes analysis while PLS was
employed as a supervised dimension reduction technique. The lower panel of Figure 3
demonstrates that GSimp preferably restores the original observation distribution with
the lowest Procrustes sum of squared errors among four imputation methods.
On the simulation dataset, we compared QRILC, kNN-TN, and GSimp using same
approaches. Consistent results were recognized (Supplemental figure 1), and GSimp
presents the best performances on the simulation dataset with the lowest SOR and
PCA/PLS-Procrustes sum of squared errors and the highest correlation of univariate
analysis results. Moreover, to examine the influences of statistical power using
different imputation methods, we calculated TPR as the capacities to detect
differential variables on different imputation datasets. Again, with both p-cutoff of
0.05 and 0.01, GSimp shows the overall highest TPR over different missing numbers
(Figure 4). This implies that GSimp impairs the sensitivity to the least extent among
three methods, which is reasonable since GSimp also keeps the highest correlation of
p-values in previous comparisons.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
4 DISCUSSION
In our approach, truncated normal distribution was used for the constraint of
imputation results in Gibbs sampler steps. We applied the minimum observed value of
missing variable as an informative upper truncation point and -∞ as a non-informative
lower truncation point considering the situation of left-censored missing. Other values
could also be applied in real-world metabolomics analyses, such as a known LOQ of a
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
metabolite can be set as an upper truncation point. Additionally, when signal intensity
of certain compound is larger than the upper limit of quantification range or saturation
during instrument analysis, an informative lower truncation point could be
correspondingly be applied for the right-censored missing value. What’s more, when
non-informative bounds for both upper and lower limits (e.g., +∞, -∞) were applied,
our GSimp could be extended to the situation of MCAR/MAR. With the flexible
usage of upper and lower limits, our approach may provide a versatile and powerful
imputation technique for different missing types. Other –omics datasets with missing
values (especially MNAR), e.g. single cell RNA-sequencing data, could apply this
method with few modifications of our default settings. Thus, it is worthy to evaluate
our approach, GSimp, in other complex scenarios in the future.
5 CONCLUSION
A practical left-censored missing value imputation method is needed in the field of
metabolomics. We develop a new imputation approach GSimp that outperforms
traditional determined value substitution method (HM) and other approaches (QRILC,
and kNN-TN) for MNAR situations. GSimp utilized predictive information of
variables and held a truncated normal distribution for each missing element
simultaneously via embedding a prediction model into the Gibbs sampler framework.
With proper modifications on the parameter settings, e.g. truncation points, GSimp
may be applicable to handle different types of missing values and in different -omics
studies, thus deserved to be further explored in the future.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
REFERENCES
Breiman,L. et al. (1984) Classification and Regression Trees.
Breiman,L. (2001) Random forests. Mach. Learn., 45, 5–32.
Butte,N.F. et al. (2015) Global metabolomic profiling targeting childhood obesity in
the Hispanic population. Am. J. Clin. Nutr., 102, 256–267.
Friedman,A.J. et al. (2015) Lasso and Elastic-Net Regularized Generalized Linear
Models. Available online
https//cran.r-project.org/web/packages/glmnet/glmnet.pdf. (Verified 29 July.
2015).
Gelman,A. and Hill,J. (2006) Data analysis using regression and
multilevel/hierarchical models.
Guo,L. et al. (2015) Plasma metabolomic profiles enhance precision medicine for
volunteers of normal health. Proc. Natl. Acad. Sci., 112, E4901–E4910.
Hastie,T. et al. (1999) Imputing missing data for gene expression arrays. Tech. Report,
Div. Biostat. Stanford Univ., 1–9.
Hoerl,A.E. and Kennard,R.W. (1970) Ridge Regression: Biased Estimation for
Nonorthogonal Problems. Technometrics, 12, 55–67.
Hrydziuszko,O. and Viant,M.R. (2012) Missing values in mass spectrometry based
metabolomics: An undervalued step in the data processing pipeline.
Metabolomics, 8, 161–174.
Katajamaa,M. et al. (2006) MZmine: toolbox for processing and visualization of mass
spectrometry based molecular profile data. Bioinformatics, 22, 634–6.
Kessler,N. et al. (2013) MeltDB 2.0-advances of the metabolomics software system.
Bioinformatics, 29, 2452–2459.
Lazar,C. et al. (2016) Accounting for the Multiple Natures of Missing Values in
Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.
J. Proteome Res., 15, 1116–1125.
Lei,S. et al. (2017) The ratio of dihomo-γ-linolenic acid to deoxycholic acid species is
a potential biomarker for the metabolic abnormalities in obesity. FASEB J.,
fj.201700055R.
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
Figure legends
The first 500 iterations out of a total of 2000 (100×20) iterations using GSimp where
' , ( and σ represent fitted value, sample value and standard deviation
correspondingly.
SOR on FFA dataset (upper left) and BA dataset (upper right) along with different
numbers of missing variables based on four imputation methods: HM (red circle),
QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross).
PCA-Procrustes sum of squared errors on FFA dataset (lower left) and BA dataset
(lower right) along with different numbers of missing variables based on four
imputation methods: HM (red circle), QRILC (green triangle), GSimp (blue square),
and kNN-TN (purple cross).
TPR along with different numbers of missing variables based on three imputation
methods: QRILC (green triangle), GSimp (blue square), and kNN-TN (purple cross)
among different p-cutoff=0.05 (left panel), and 0.01 (right panel).
bioRxiv preprint first posted online Aug. 26, 2017; doi: http://dx.doi.org/10.1101/177410. The copyright holder for this preprint
(which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
All rights reserved. No reuse allowed without permission.
Scatter plots of imputed values (X-axis) and original values (Y-axis) on one example
missing variable while non-missing elements represented as blue dots and missing
elements as red dots based on four imputation methods: HM (upper left), QRILC
(upper right), kNN-TN (lower left), and GSimp (lower right). Rug plots show the
distributions of imputed values and original values.
SOR (upper left), PCA-Procrustes sum of squared errors (upper right), Pearson's
correlation between log-transformed p-values of student’s t-tests (lower left), and
PLS-Procrustes sum of squared errors (lower right) on simulation dataset along with
different numbers of missing variables based on three imputation methods: QRILC
(green triangle), GSimp (blue square), and kNN-TN (purple cross).
SOR on simulation dataset along with different numbers of missing variables based
on four different numbers of iterations: iters_each=50 and iters_all=20 (red circle),
iters_each=100 and iters_all=20 (green triangle), iters_each=50 and iters_all=10
(blue square), iters_each=100 and iters_all=10 (purple cross).
0.2
0.1
ỹ -0.1
-0.2
-0.3
-0.2
-0.1
0
0.1
0.1
0.2
ŷ 0.2 0.3
0.3 0.4
σ
0.4 0.5
0.6
FFA BA
80
80
60
60
HM
SOR
QRILC
40 GSimp
40
kNN-TN
20 20
0
5 10 15 20 25 5 10 15 20
0.015
0.03
PCA-Procrustes SS.
0.010
0.02 HM
QRILC
GSimp
kNN-TN
0.01 0.005
0.00 0.000
5 10 15 20 25 5 10 15 20
0.9 0.9
Correlation
HM
QRILC
GSimp
kNN-TN
0.8 0.8
0.7 0.7
5 10 15 20 25 5 10 15 20
0.06
0.015
PLS-Procrustes SS.
0.04
0.010
HM
QRILC
GSimp
kNN-TN
0.02
0.005
0.000 0.00
5 10 15 20 25 5 10 15 20
0.9
0.95
QRILC
TPR
TPR
GSimp
0.90 kNN-TN
0.8
0.85
0.7
0.80
20 40 60 20 40 60
125 125
100 100
Original Value
Original Value
75 75
50 50
25 25
25 50 75 100 125 25 50 75 100 125
kNN-TN GSimp
125 125
100 100
Original Value
Original Value
75 75
50 50
25 25
25 50 75 100 125 25 50 75 100 125