Advancing Material Property Prediction Using Physics-Informed Machine Learning Models for Viscosity
Advancing Material Property Prediction Using Physics-Informed Machine Learning Models for Viscosity
Abstract
In materials science, accurately computing properties like viscosity, melting point, and glass transition temperatures
solely through physics-based models is challenging. Data-driven machine learning (ML) also poses challenges in con-
structing ML models, especially in the material science domain where data is limited. To address this, we integrate
physics-informed descriptors from molecular dynamics (MD) simulations to enhance the accuracy and interpretability
of ML models. Our current study focuses on accurately predicting viscosity in liquid systems using MD descriptors.
In this work, we curated a comprehensive dataset of over 4000 small organic molecules’ viscosities from scientific
literature, publications, and online databases. This dataset enabled us to develop quantitative structure–property
relationships (QSPR) consisting of descriptor-based and graph neural network models to predict temperature-
dependent viscosities for a wide range of viscosities. The QSPR models reveal that including MD descriptors improves
the prediction of experimental viscosities, particularly at the small data set scale of fewer than a thousand data
points. Furthermore, feature importance tools reveal that intermolecular interactions captured by MD descriptors
are most important for viscosity predictions. Finally, the QSPR models can accurately capture the inverse relationship
between viscosity and temperature for six battery-relevant solvents, some of which were not included in the original
data set. Our research highlights the effectiveness of incorporating MD descriptors into QSPR models, which leads
to improved accuracy for properties that are difficult to predict when using physics-based models alone or when lim-
ited data is available.
Keywords Classical molecular dynamics simulations, Organic molecules, Physical properties, Viscosity, Quantitative
structure–property relationships, Machine learning
*Correspondence:
Mohammad Atif Faiz Afzal
[email protected]
Full list of author information is available at the end of the article
© The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco
mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Chew et al. Journal of Cheminformatics (2024) 16:31 Page 2 of 14
Graphical Abstract
Fig. 1 Distribution of the curated viscosity dataset. A Log-scale viscosity (µ) in centipoise as a function of temperature of three example
battery-relevant structures. Chemical structures are drawn within the plot, and linear dashed lines are included as visual guides. Histogram of B
log-scale µ and C temperature in Kelvins of the final viscosity dataset consisting of 4440 examples
state-of-the-art deep learning approach is graph neural improve the model accuracy. Finally, we employ feature
networks (GNN), specifically graph convolutional net- importance analysis tools to evaluate the influence of
works, which uses convolution operators that learn fea- molecular-based and physics-informed descriptors on
tures directly from a graph representation of a molecule QSPR performance. We demonstrate that the developed
(i.e. representing atoms as nodes and bonds as edges) models are highly accurate and can be used for quick
[13]. GNNs are a promising approach to autonomously estimation of viscosity of new molecules, which enables
create structure–property relationships without hav- these models to be used for the high-throughput screen-
ing to pre-define descriptors based on expert domain ing of viscosities.
knowledge [14]. However, it is still unclear whether
GNNs outperform the descriptor-based models, where
the prediction accuracy of both approaches is depend- Methods
ent on the type and size of the data [12]. Furthermore, Viscosity dataset
it is unclear how the inclusion of external features (such We extracted viscosities, temperatures, and structures
as temperature) might impact the prediction accuracy of from the relevant literature and online databases [2, 16–
either descriptor-based descriptors or GNN approaches. 28]. Details of the literature sources are included in Addi-
Finally, developing accurate QSPR models requires a tional file 1: Table S1. All structures were represented as
large, curated viscosity dataset that could broadly gener- simplified molecular-input line entry-system (SMILES)
alize viscosity values across a wide range of temperatures. strings. We curated an initial dataset of 5356 viscosity
Some recent work has explored the use of data-driven entries, covering a wide range of temperatures and vis-
methods to predict viscosities, such as group contribu- cosities. Then, we filtered the dataset using the follow-
tion methods for n-alkanes and iso-alkanes [9] or GNNs ing steps: (1) filtered for single, organic structures with
for single and binary liquid mixtures [15]. However, the atomic elements of {H, C, N, O, F, Si, P, S, Cl, Br, and I};
comparison between descriptor-based and graph-based (2) since high experimental errors were observed for high
approaches, as well as the inclusion of physics-informed and low extremums of the viscosity and temperature val-
descriptors, has not been well-explored. ues, the dataset was filtered using the box-and-whisker
In this work, we have extracted and cleaned a large plot method, where viscosities and temperatures that
dataset of over 4000 experimental viscosities of small fall outside of 1.5 times their corresponding interquar-
molecules at various temperatures from multiple litera- tile range are removed as outliers; (3) since the viscosity
ture sources. We use this viscosity dataset to build and values are expected to be inversely proportional to tem-
benchmark machine learning models that can predict perature for bulk liquids, data points that have a posi-
viscosity as a function of temperature. We constructed tive deviation of viscosity with respect to temperature
both descriptor-based and GNN-based QSPR models to greater than 0.02 cP were removed as outliers (positive
evaluate whether learned features from graphs could out- deviations often arise from different literature sources).
perform hand-crafted features in predicting viscosities. After applying the data filtration process, we used a total
Additionally, we incorporate information obtained from of 4440 viscosity entries for ML model development.
physics-based simulations into the ML models to further This dataset consists of 1005 unique structures, with
Chew et al. Journal of Cheminformatics (2024) 16:31 Page 4 of 14
viscosities ranging from 0.10 cP to 26.52 cP, and tempera- Descriptor‑based QSPR models
tures ranging from 227 to 404 K. Since only 136 of the The general workflow for developing descriptor-based
1005 unique structures have stereoisomers, we did not models is summarized in Fig. 2A. All molecules were
account for the impact of isomerism in this work. We featurized with 209 RDKit descriptors, 1000 Morgan
apply log transform of viscosity to ameliorate the skewed fingerprints, and 132 Matminer descriptors. Featuri-
distribution of viscosity values; thus, all viscosities will be zation for RDKit and Morgan fingerprints were imple-
presented in the log-scale as log µ, where µ has units of mented using the rdkit package (Version 2021.09.4)
centipoise. [29], whereas Matminer descriptors were imple-
Figure 1A shows the log-scale viscosity as a function mented using the matminer package (Version 0.6.3)
of temperature for three representative small molecules [30]. Based on the Vogel equation of viscosity [11],
(methyl acetate, ethyl acetate, and methyl butyrate), we expect that log µ is proportional to the inverse
which are electrolytes relevant to the designing of Li-ion of temperature; hence, we input the inverse of tem-
batteries [5]. Figure 1A highlights the inverse proportion- perature for all ML models. External features, such as
ality expected between viscosity and temperature, where experimental inverse temperature or physics-based
higher temperature values yield lower viscosities. Fig- descriptors, were included as an additional descrip-
ure 1B and C shows the histogram of log-scale viscosity tor into the models; hence, a total of 1341 + Next fea-
and temperatures for the 4,440 entries respectively. Both tures were passed into ML model development, where
Fig. 1B and C shows a right-skewed normal distribution Next is the number of external features. All features
for both log-scale viscosity and temperatures, which were preprocessed with the following procedure: (1)
means that data is more spread apart at larger viscos- correlated features with Pearson’s r greater than or
ity and temperature values. We used the 4,440 viscosity equal to 0.90 were removed; (2) constant features with
entries to train and evaluate all QSPR models. variance of zero were removed; and, (3) features were
standardized by subtracting the mean and dividing by
Fig. 2 Descriptor-based QSPR approaches for predicting viscosity. A Workflow of the descriptor-based approaches using methyl acetate
as an example. Methyl acetate is featurized with RDKit, Morgan fingerprint, and Matminer descriptors. A total of 1341 + Next (external features)
features were passed into machine learning model development. The inverse temperature is included in model development to incorporate
temperature effects. B Five-fold cross validation and test set RMSE for QSPR models. The average RMSE is reported across five out-of-sample
train-test splits and the RMSE uncertainty is estimated by computing the standard deviation across the splits. C Parity plot between predicted
and actual log-viscosity showing the validation set predictions across 5-CV on the training set for a single train/test split when using the LGBM
model, which had the highest model score based on Eq. 1. Each color indicates the different validation sets for each of the five folds. The number
of examples used (N), R2, and RMSE for 5-CV are reported within the plot. D Parity plot between predicted and actual log viscosity for a single
80:20 train:test split for the LGBM model. The total number of examples used (N) and statistics (i.e. R2 and RMSE) for train and test sets are reported
within the plot. For all parity plots, a dashed diagonal y = x line is drawn as a guide to indicate which predictions are in agreement with the actual
values
Chew et al. Journal of Cheminformatics (2024) 16:31 Page 5 of 14
the standard deviation. On average, 876 of the descrip- the Additional file 1: Table S2. For all descriptor-based
tors remained after feature preprocessing, which were QSPR models, we used a bagging regressor approach
passed as inputs into ML algorithms. Eight differ- to allow for estimation of prediction errors, where 20
ent ML algorithms were tested: multilayer perceptron estimators for each ML algorithm were independently
(MLP), support vector regression (SVR), random forest trained by randomly sampling the training set with
(RF), gradient boosting regression (GBR), light gradi- replacement. Prediction values are reported by com-
ent-boosting machine (LGBM) [31], extreme gradient puting the average prediction of the 20 estimators, and
boosting (XGB) [32], least absolute shrinkage and selec- prediction uncertainties are computed using the 90%
tion operator (LASSO), and partial least squares (PLS). confidence interval of the prediction values.
All models were implemented with the scikit-learn
package (Version 1.0.2) [33], except LGBM (lightgbm GNN QSPR models
package, Version 3.2.1) and XGB (xgboost package, GNN models were built using DeepAutoQSAR,
Version 1.5.1). We selected these ML algorithms based Schrödinger’s automated molecular property predic-
on the current state-of-the-art in the literature to iden- tion engine [34, 35]. For GNNs, molecules are treated
tify the best ML algorithm to predict liquid viscosity as molecular graphs with atoms as nodes and bonds as
[12]. For LASSO models, sparsity or reduction of fea- edges, which is illustrated in Fig. 3A. A total of 75 fea-
ture space was applied by modifying the “alpha” param- tures + Next (external features) were used to featur-
eter in the sklearn module, which dictates the extent of ize each heavy atom. Atomic featurizations include
L1 regularization on the coefficients of a linear regres- one-hot encodings of atomic number, implicit valence,
sion. For SVR models, we used the default radial basis formal charge, atomic degree, number of radial elec-
function kernel type in the sklearn module. Hyperpa- trons, hybridization, and aromaticity [35]. External fea-
rameters for descriptor-based models are described in tures were standardized by subtracting the mean and
Fig. 3 Graph neural network QSPR approaches for predicting viscosity. A Workflow of the graph neural network (GNN) based approaches
using methyl acetate as an example. Methyl acetate is represented as a molecular graph (G) with atoms as nodes (V) and bonds as edges (E).
B Five-fold cross validation and test set RMSE for QSPR models. The average RMSE is reported across five random train-test splits and the RMSE
uncertainty is estimated by computing the standard deviation across the splits. LGBM is included in this plot as a comparison between the best
descriptor-based QSPR model against GNN QSPR models. Only the top five performing GNNs are shown for brevity, which were selected
based on Eq. 1. C Parity plot between predicted and actual log-viscosity showing the validation set predictions across 5-CV on the training set
for a single train/test split when using the EdgePool model, which had the highest model score based on 5-CV and test set R2. Each color indicates
the different validation sets for each of the five folds. The number of examples used (N), R2, and RMSE for 5-CV are reported within the plot. D
Parity plot between predicted and actual log viscosity for a single 80:20 train:test split for the EdgePool model. The total number of examples used
(N) and statistics (i.e. R2 and RMSE) for train and test sets are reported within the plot. For all parity plots, a dashed diagonal y = x line is drawn
as a guide to indicate which predictions are in agreement with the actual values
Chew et al. Journal of Cheminformatics (2024) 16:31 Page 6 of 14
dividing by the standard deviation before being passed percentage free volume (MD_FV), radius of gyration of
into GNNs. For each atom, GNNs aggregate informa- the molecule (MD_Rg), Hansen solubility parameters
tion from its neighboring atoms and update a new atomic (MD_SP, MD_SP_E, and MD_SP_V), heat of vaporiza-
vector based on message passing across the molecular tion (MD_HV), and root-mean-square displacement
graph. The final learned atomic features outputted by the (MD_RMSD) (see Additional file 1: Section S2.1 for
readout phase are then inputted into a fully connected details). MD descriptors were computed by taking the
layer to predict log viscosities. Ten graph-based model ensemble-average over the last 10 ns simulation of the
approaches were evaluated: Graph Convolution Neural production run, and these descriptors show convergence
Network (GCN) [36], Pytorch version of GCN (Torch- for both low and high viscosity examples (see Additional
GraphConv) [37], TopK [38], GraphSAGE [39], Graph file 1: Figs. S3 and S4). Averaging MD descriptors using
Isomorphism Network (GIN) [40], Self-Attention Graph multiple replicas of MD simulations may yield better
Pooling (SAGPool) [41], EdgePool [42], GlobalAtten- monotonic trends as a function of temperature, but their
tion [40], Set2Set [43], and SortPool [44]. Different GNN values do not significantly differ as compared to descrip-
models differ slightly by how they aggregate information tors from a single MD simulation (see Additional file 1:
based on successes from previous literature [40, 42]. All Fig. S9). Therefore, we use MD descriptors from a sin-
graph-based models were trained with PyTorch (Version gle simulation. These MD descriptors were inputted as
1.9.0) [45] for 500 epochs, a learning rate of 0.01, and a external features into the ML models to evaluate whether
dropout ratio of 0.25. Hyperparameters for GNNs are they could improve the prediction accuracy of viscosi-
described in the Additional file 1: Table S3. ties. While MD simulations can yield highly informative
descriptors, they also incur additional simulation costs.
Classical molecular dynamics simulations The estimated computational cost is around one hour
We performed MD simulations for all the structures at per structure and temperature, assuming the use of a
each experimental temperatures in the viscosity data- computer with a GPU similar to the NVIDIA Tesla T4.
set to evaluate whether the inclusion of MD descrip- However, this cost could be mitigated by employing more
tors would improve ML models. For all simulations, we efficient GPUs.
used the Schrödinger’s Materials Science Suite (MSS)
[46], which leverages the Desmond MD engine to rap- QSPR model training and evaluation
idly speed up MD computations through GPU accelera- The workflow used to evaluate QSPR models is shown
tion [7, 47, 48]. All molecules were parameterized with in Additional file 1: Fig. S5. To alleviate the effect of ran-
the OPLS4 force field [49]. For each system, we first domness in data splitting, five independent runs with
constructed an amorphous simulation cell with approxi- different random seeds were performed with an 80:20
mately 8000 atoms. The initial density of the system in train:test split. Previous literature has used multiple
the amorphous cell structure was 0.5 g/cm3. train/test splits to better assess the accuracy of machine
The equilibration procedure consisted of Brownian learning models [12]. While the average model perfor-
minimization of 150 ps, 0.5 ns NVT ensemble (Number mance of multiple train/test splits is similar to the model
of atoms, Volume, and Temperature are conserved) with performance when using a single train/test split for pre-
2 fs time step at temperature of 500 K and pressure of 1 dicting viscosity (see Fig. S6 in the Additional file 1), we
atm, 1 ns NPT ensemble (Number of atoms, Pressure, only report the average model performance of the multi-
and Temperature are conserved) with 2 fs time step at ple train/test splits to avoid possible bias in data splitting.
temperature of 400 K and pressure of 1000 bar, 2 ns NPT Since the viscosity dataset contains multiple entries with
ensemble with 2 fs time step at temperature of 300 K and the same molecule at different temperature and viscosity
pressure of 1 atm, 5 ns NPT ensemble with 2 fs time step values, we implement an out-of-sampling approach for
at the temperature (Texp) where experimental viscosity is data splitting, where unique compounds are iteratively
reported K and pressure of 1 atm, 10 ns NPT ensemble introduced to the training set until it reaches 80% of the
with 2 fs time step at Texp and pressure of 1 atm. After dataset and the remaining 20% of the data is placed in
this equilibration protocol, we take the average cell size the testing set. Previous studies have observed that out-
of the last 20% of the previous step and subsequently of-sampling splitting is a better approach to measure
perform 1 ns NVT ensemble with 2 fs time step at Texp. model accuracy as compared to random splitting from
The final production run consists of 20 ns NVT ensemble an application standpoint because model performance
with 2 fs time step at Texp with saving a frame at every from random splitting may lead to over-optimistic model
100 ps interval. performance for datasets with repeated molecules where
We extracted eight MD descriptors from the final pro- the same molecule could appear in both train and test
duction MD simulation: packing density (MD_density), sets [50]. Therefore, all train/test splits in this work uses
Chew et al. Journal of Cheminformatics (2024) 16:31 Page 7 of 14
the out-of-sampling approach such that the test set has the sign of the importance is determined by computing
unique compounds from the training set. the Pearson’s r correlation coefficient between the Shap-
For each train/test split, a five-fold cross validation pro- ley and descriptor values. Positive Pearson’s r between
cedure (5-CV) was implemented on the training set for Shapley and descriptor values indicate that the feature
hyperparameter tuning and evaluating model generaliz- positively contributes to the output property, whereas
ability across the training set. In 5-CV, the training set negative Pearson’s r indicates the converse. Additional
is partitioned into five separate sets, whereby for each details about the SHAP method could be found in previ-
of the five folds, one set is left-out as the validation set ous literature [12, 55, 56].
using the out-of-sample data splitting approach and the
remaining sets are used to train the model; this proce- Results and discussion
dure is repeated five times until all of the data instances Performance of descriptor‑based QSPR models
are within the left-out set exactly once. In this work, we We first sought to develop QSPR models using the
report the 5-CV coefficient of determination ( R2) and descriptor-based approach, where hand-crafted two-
root-mean-square error (RMSE) of the left-out sets only, dimensional (2D) descriptors and fingerprints are used
which measures the model performance on new com- as inputs into the machine learning model. Figure 2A
pounds. After selecting the best hyperparameters from shows the general workflow for inputting hand-crafted
5-CV, the model is re-trained with the entire training set descriptors and external descriptors, such as inverse tem-
and used to predict the test set. The models are evalu- perature, into QSPR models to predict log viscosities (see
ated based on their ability to accurately generalize across Methods for more details). Figure 2B shows the 5-CV and
the training set using the 5-CV approach and predict test set RMSE for the eight ML algorithms when using
the testing set, which is summarized by a model score five random, out-of-sample 80:20 train:test splits across
(ScoreM ) in Eq. 1. the viscosity dataset. ML algorithms were rank-ordered
based on their model scores as described in Eq. 1. From
2 2 2
ScoreM = Rtest × 1 − R5−CV − Rtest (1) Fig. 2B, we observe that tree-based ML models, such
as LGBM, XGB, and GBR, were the top performers in
2
Rtest and R5−CV
2 is the coefficient of determination for predicting log viscosities, followed by other non-linear
the test and 5-CV of the train set, respectively. ScoreM approaches such as SVR and MLP. Linear models like
rewards models that exhibit high generalizability for both LASSO and PLS perform the worst, suggesting that a
the training and testing sets. ScoreM penalizes mod- non-linear relationship between the 2D descriptors and
els where the accuracy is low for both sets or when the log viscosities may be necessary for an accurate model.
accuracies between the two sets are very distinct, which For all models, 5-CV and test set RMSEs are very similar,
may be indicative of overfitting or poor generalization. which shows that the models’ ability to generalize across
ScoreM is similar to previous model scoring functions the training set is correlative to its ability to generalize to
in the literature that automatically select good models for unseen examples.
structure–property relationships [51]. We primarily use Since LGBM had the highest model score, we further
ScoreM to rank-order QSPR models based on accuracy investigated its accuracy in the 5-CV of the training set
on 5-CV and test set prediction accuracy. All QSPR mod- and in test set predictions. Figure 2C shows the parity
els were implemented using Python (Version 3.8.15). plot between predicted versus actual log viscosities when
performing 5-CV across the training set when using the
Model interpretation LGBM algorithm; only predictions on the left-out vali-
Feature importance was evaluated using the SHapley dation set are shown for each of the five cross validation
Additive exPLanations (SHAP) approach (shap pack- folds. The 5-CV parity plot shows that the majority of the
age, Version 0.41.0), which is a game theory approach to points lie along the diagonal y = x line, suggesting that
quantify the contributions of single players in a collabo- the LGBM model generalizes well across the training
rative game [52, 53]. Shapley values measure the impact set with a 5-CV R2 of 0.88 and RMSE of 0.16. Figure 2D
of a descriptor to an output property by including or shows a parity plot of predicted versus actual log viscosi-
excluding the descriptor across a set of instances. SHAP ties for the training set and testing set when performing
is a local model-agnostic method for explaining individ- an 80:20 train:test split and using the LGBM algorithm.
ual predictions. SHAP can also be used as a global inter- The LGBM model learned the training set well with a
pretation method by aggregations of Shapley values [54]. train R2 of 0.99 and RMSE of 0.04 and predicted the
For all SHAP calculations, we use the test set instances to left-out test set with lower accuracy (i.e. test R2 of 0.91
measure descriptor importance. The average magnitude and RMSE of 0.13). The parity plots in Fig. 2C and D
of Shapley values is reported (i.e. Mean SHAP ), and show minimal outliers in the LGBM model predictions,
Chew et al. Journal of Cheminformatics (2024) 16:31 Page 8 of 14
which suggests the model is accurately capturing trends may slightly outperform graph-based approaches for this
between structure, temperature, and viscosities. viscosity dataset. Figure 3C shows the parity plot between
predicted versus actual log viscosities when performing
Performance of GNN QSPR models 5-CV across the training set when using the EdgePool
We next evaluated whether GNNs might outperform the model. EdgePool achieves a 5-CV R2 of 0.84 and RMSE
descriptor-based approaches in predicting temperature- of 0.18, which is slightly poorer compared to LGBM (see
dependent viscosities. Figure 3A shows the general work- Fig. 2C). Figure 3D shows a parity plot between predicted
flow of using GNNs to predict viscosities using methyl versus actual log viscosities for an 80:20 train:test split
acetate as an example (see Methods section for details). using the EdgePool algorithm. In comparison to LGBM
Figure 3B shows the 5-CV and test set R2 for the top five (Fig. 2D), EdgePool achieves a slightly poorer test set R2
GNN models ranked based on model score and the top of 0.89 and RMSE of 0.15. Overall, these results show
descriptor-based LGBM model as a comparison. While that GNNs could be used to predict viscosities; however,
the EdgePool model had the highest model score, the descriptor-based approaches perform slightly better for
overall 5-CV and test set R2 is comparable between the this dataset.
different GNN approaches, which suggests that varying
GNN architectures did not yield higher accuracy in vis- Impact of molecular simulation derived descriptors
cosity predictions. The GNN models have slightly lower on QSPR models for viscosity
5-CV and test set R2 as compared to the descriptor-based We next investigated whether the inclusion of physics-
LGBM model (performance is drawn as a vertical dashed based descriptors computed from molecular dynamics
line), which suggests that descriptor-based approaches simulations could help improve the QSPR accuracy of
Fig. 4 Impact of MD descriptors in QSPR models for viscosity predictions. A Simulation snapshot of methyl acetate at T = 298 K, which was used
to compute eight MD descriptors. B Test set root-mean-square error (RMSE) for descriptor-based LGBM model and GNN-based EdgePool model
when including two-dimensional descriptors (2D), molecular dynamics (MD) descriptors, or combinations of 2D and MD (2D and MD) into the QSPR
models. The average RMSE is reported across five random, out-of-sample train-test splits and the RMSE uncertainty is estimated by computing
the standard deviation across the splits. C Log-scale learning curve showing test set RMSE versus train set size when using 20% of the dataset as test
set and re-training the models with increasing training set sizes. These curves are plotted for LGBM and EdgePool models with and without MD
descriptors. Twenty train-test splits were implemented to obtain accurate measurements of test RMSE, where the mean test set RMSE is reported
and the uncertainty is estimated by the standard deviation of the test set RMSEs
Chew et al. Journal of Cheminformatics (2024) 16:31 Page 9 of 14
Temperature‑dependent viscosity predictions the viscosity dataset used in this work. We investigate
for battery‑relevant solvents whether the QSPR models in this work could predict the
We next evaluated whether the QSPR models can cap- experimental viscosity trends measured from Ref. [5].
ture the temperature dependence of log viscosities. To eliminate the effect of data splitting, we re-trained
We focused on six pure solvents previously studied by the QSPR models using the entire viscosity dataset in
Logan and coworkers, which were used to potentially this work. Figure 6 shows the log viscosities versus tem-
improve lithium ion battery electrolytes: methyl acetate perature predictions for the six solvents using descriptor-
(MA), ethyl acetate (EA), methyl butyrate (MB), methyl based LGBM and GNN-based EdgePool models with
propionate (MP), dimethyl carbonate (DMC), and ethyl varying featurization inputs (2D descriptors only, 2D
methyl carbonate (EMC) [5]. These pure solvents could and MD descriptors, and MD descriptors only). MA, EA,
be added as co-solvents for lithium ion batteries to lower MB, and MP are structures within the viscosity dataset
viscosities and increase electric conductivity; hence, (i.e. training set) and encompass the same range of tem-
these solvents can improve how fast a battery can charge peratures as experimentally measured in Ref. [5]. Hence,
or discharge. The authors report experimental tempera- across all QSPR models and featurization schemes,
ture-dependent viscosities for these six solvents, which the experimental points shown as orange triangles are
were not used in the original data curation of viscosities well-captured for MA, EA, MB, and MP (see Fig. 6A–
in this work. However, some of these solvents have been D). These results show that the QSPR models capture
observed in other databases (e.g. PubChem [20]), so there experimental trends from Ref. [5] for structures and
is some overlap between the viscosities from Ref. [5] and temperatures already seen in the training set, suggesting
Fig. 6 QSPR performance on six battery-relevant solvents. Predictions of descriptor-based LGBM model and GNN-based EdgePool model
when using two-dimensional descriptors (2D), molecular dynamics (MD) descriptors, or combinations of 2D and MD (2D and MD) in the QSPR
models for six battery electrolytes: A methyl acetate (MA); B ethyl acetate (EA); C methyl butyrate (MB); D methyl propionate (MP); E dimethyl
carbonate (DMC); and F ethyl methyl carbonate (EMC). Orange triangles represent experimental viscosities extracted from Ref [5]. MA, EA, MB,
and MP are in the training set and contain the temperature ranges that encompass those found in Ref [5]. DMC is partially in the training set such
that only two temperatures are provided to the models at T = 293.15 and 298.15 K. EMC is not within the training set at all. Molecular structures are
drawn in the upper right of each plot
Chew et al. Journal of Cheminformatics (2024) 16:31 Page 12 of 14
consistency between the viscosity values from Ref. [5] to using two-dimensional descriptors alone, suggesting
and the viscosity dataset in this work. that using features that capture intermolecular interac-
For DMC (see Fig. 6E), the solvent is partially within tions can help improve predictions of viscosities. The
the training set such that only two temperatures at T improvement in prediction accuracy upon inclusion
= 293.15 K and 298.15 K have been seen by the model. of MD descriptors is most pronounced when training
Hence, the QSPR models would be extrapolating across a viscosity models using small datasets of less than 1000
wider range of temperatures between 280 K to 323 K that examples. Analyzing the top features related to viscos-
were experimentally varied in Ref. [5]. We observe that ity for the LGBM model reveal that MD descriptors
EdgePool with (cyan line) and without MD descriptors become most important to predicting viscosity, specifi-
(green line), as well as LGBM with 2D and MD descrip- cally the heat of vaporization that captures nonbonded
tors (blue line), can accurately capture the experimental interactions between molecules. Finally, the QSPR
viscosities. Interestingly, predictions from LGBM models models can accurately capture the inverse relationship
with 2D descriptors or MD descriptors alone have the between temperature and viscosity for six battery-rele-
largest deviation from the experimental viscosities, which vant solvents.
suggests that combining 2D and MD descriptors helped These results demonstrate that regardless of descrip-
improve generalizations across temperature. For EMC tor-based or graph-based models, the inclusion of MD
(see Fig. 6F), the solvent is not within the training set; descriptors that capture intermolecular interactions is
hence, QSPR models would be predicting on a new mole- useful for prediction of viscosities, especially at small
cule. We observe similar trends as in Fig. 6E, where Edge- data sizes. The usefulness of MD descriptors may be
Pool with and without MD descriptors accurately capture even more relevant for mixture systems, where MD
experimental viscosity trends. LGBM with 2D and MD descriptors could more broadly generalize since they
descriptors outperform models trained with 2D or MD are not single-molecule-dependent as compared to
descriptors alone in capturing experimental trends. two-dimensional structural descriptors. However, one
Altogether, the predictions on the six battery-relevant of the drawbacks of using MD descriptors is the com-
solvents show that these QSPR models can: (1) capture putational cost to generate them. The improvement in
the inverse relationship between log viscosity and tem- accuracy from using MD at the small data scale, gener-
perature, (2) predict temperature-dependent viscosities alizability of MD descriptors to heterogeneous systems,
of new structures, and (3) improve in generalizability by and generating automated computational workflows
inclusion of MD descriptors for descriptor-based LGBM may help outweigh the cost of computing these
models. Given that the EdgePool model without the descriptors. Future work will investigate the utility of
inclusion of MD descriptors performed well on battery MD descriptors in predicting viscosities for mixture
solvents shown in Figure 6, we use this model to predict systems, such as binary mixtures explored in a recent
log viscosities for other solvents related to battery elec- work [15] (Additional file 3).
trolyte design for lithium metal anodes from Ref. [66].
Viscosity predictions for 50 solvents at temperature
ranges between 270 and 330 K are available in the Addi- Scientific contribution
tional file 1: Section S4.3. Future work will focus on using
these models to screen new compounds to identify mate- • Curated a viscosity dataset of more than 4000 exam-
rials with promising viscosities. ples consisting of small organic molecules and
trained quantitative structure property relationships
Conclusion (QSPR) models to accurately predict viscosity as a
In this work, we developed quantitative structure– function of temperature.
property relationships (QSPR) to predict temperature- • Encoding molecular dynamics (MD) simulation-
dependent viscosities of small organic molecules using derived descriptors that capture intermolecular
a curated dataset of over 4000 experimental viscosities. interactions improve viscosity prediction, especially
Both descriptor-based and graph-based models were in small data scenarios.
benchmarked to identify the best machine learning • Feature importance analysis reveal that MD-derived
algorithms that could accurately predict experimen- heat of vaporization is found to be the most useful
tal viscosities, which were the light gradient-boosting descriptor relevant to viscosity even in the presence
machine (LGBM) algorithm and EdgePool algorithms of hundreds of two-dimensional descriptors.
for descriptor-based and graph-based approaches,
respectively. Including molecular dynamics (MD)
descriptors slightly improved QSPR models compared
Chew et al. Journal of Cheminformatics (2024) 16:31 Page 13 of 14
Supplementary Information 5. Logan ER, Tonita EM, Gering KL, Li J, Ma X, Beaulieu LY, Dahn JR (2018) A
study of the physical properties of li-ion battery electrolytes containing
The online version contains supplementary material available at https://doi.
esters. J Electrochem Soc 165(2):A21
org/10.1186/s13321-024-00820-5.
6. Santak P, Conduit G (2020) Enhancing NEMD with automatic shear rate
sampling to model viscosity and correction of systematic errors in mod-
Additional file 1. This file contains details of the curated viscosity dataset, eling density: application to linear and light branched alkanes. J Chem
how molecular dynamics descriptors are computed, the correlation Phys 153(1):014102
between top descriptors and viscosity, the stability of molecular dynamics 7. Mohanty S, Stevenson J, Browning AR, Jacobson L, Leswing K, Halls MD,
descriptors, and hyperparameters for QSPR models. Afzal MAF (2023) Development of scalable and generalizable machine
Additional file 2. Viscosity dataset used for generating ML models. learned force field for polymers. Sci Rep 13(1):17251
8. Reid RC, Prausnitz JM, Poling BE (1987) The properties of gases and
Additional file 3. Viscosity predictions for 50 battery-relevant solvents at liquids, 4th edn. McGraw-Hill, New York
temperature ranges between 270 K - 330 K. 9. Jovanović JD, Grozdanić ND, Radović IR, Kijevčanin ML (2023) A new
group contribution model for prediction liquid hydrocarbon viscosity
based on free-volume theory. J Mol Liq 376:121452
Acknowledgements 10. Zhu Ling, Chen Jiaqing, Liu Yan, Geng Rongmei, Junjie Yu (2012)
We are grateful to the data team at Schrödinger for their assistance in the Experimental analysis of the evaporation process for gasoline. J Loss Prev
data curation of literature viscosity values, namely Asela Chandrasinge, Sophia Process Ind 25(6):916–922
Newman, and Mohammed Sulaiman. 11. Poling BE, Prausnitz JM, O’Connell JP (2000) The properties of gases and
liquids, 5th edn. McGraw Hill professional, McGraw Hill LLC, New York
Author contributions 12. Jiang D, Zhenxing W, Hsieh C-Y, Chen G, Liao B, Wang Z, Shen C, Cao D,
M.A.F.A. conceived the idea; A.K.C. and M.S. worked on the initial idea and Jian W, Hou T (2021) Could graph neural networks learn better molecular
report; J.C.E. helped in cleaning the initial data and conversion to SMILES; representation for drug discovery? A comparison study of descriptor-
A.K.C. extended the work by adding new data, adding advanced machine based and graph-based models. J Chem 13(1):1–23
learning algorithms, and implementing feature importance; A.K.C. wrote the 13. Reiser Patrick, Neubert Marlen, Eberhard André, Torresi Luca, Zhou Chen,
manuscript; M.A.F.A. supervised the work; all authors modified and approved Shao Chen, Metni Houssam, van Hoesel Clint, Schopmans Henrik, Som-
the manuscript. mer Timo et al (2022) Graph neural networks for materials science and
chemistry. Commun Mater 3(1):93
Data availibility 14. Zhenqin W, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS,
Only 3582 of 4440 examples of the viscosity dataset are made available due to Leswing K, Pande V (2018) MoleculeNet: a benchmark for molecular
copyright restrictions as described in the Additional file 2. The subset viscosity machine learning. Chem Sci 9(2):513–530
dataset and a pre-trained LGBM model using the subset dataset are provided 15. Bilodeau C, Kazakov A, Mukhopadhyay S, Emerson J, Kalantar T, Muzny C,
under the Creative Commons Non-Commercial 4.0 International (CC-BY-NC Jensen K (2023) Machine learning for predicting the viscosity of binary
4.0) Attribution License. This license allows for the use of the data and the liquid mixtures. Chem Eng J 464:142454
creation of adaptations, exclusively for non-commercial purposes, provided 16. Saldana DA, Starck L, Mougin P, Rousseau B, Ferrando N, Creton B (2012)
that appropriate credit is given. The additional file contains details of the Prediction of density and viscosity of biofuel compounds using machine
curated viscosity dataset, how molecular dynamics descriptors are computed, learning methods. Energy Fuels 26(4):2416–2426
correlation between top descriptors and viscosity, stability of molecular 17. Viswanath DS, Ghosh TK, Prasad DHL, Dutt NVK, Rani KY, Viswanath DS,
dynamics descriptors, hyperparameters for QSPR models, and availability of Ghosh TK, Prasad DHL, Dutt NVK, Rani KY (2007) Correlations and estima-
the viscosity dataset and model. tion of pure liquid viscosity. In: Viscosity of liquids: theory, estimation,
experiment, and data, pp 135–405
Declarations 18. Cocchi Marina, Benedetti Pier Giuseppe De, Seeber Renato, Tassi Lorenzo,
Ulrici Alessandro (1999) Development of quantitative structure- property
Competing interests relationships using calculated descriptors for the prediction of the phys-
The authors declare no competing interests. icochemical properties (n d, ρ , bp, ε , η) of a series of organic solvents. J
Chem Inform Comput Sci 39(6):1190–1203
Author details 19. Kauffman Gregory W, Jurs Peter C (2001) Prediction of surface tension,
1
Schrödinger, Inc., New York 10036, USA. 2 Schrödinger, Inc., Portland, OR viscosity, and thermal conductivity for common organic solvents using
97204, USA. 3 Schrödinger, Inc., San Diego, CA 92121, USA. quantitative structure- property relationships. J Chem Inform Comput Sci
41(2):408–418
Received: 13 July 2023 Accepted: 27 February 2024 20. Kim Sunghwan, Thiessen Paul A, Cheng Tiejun, Zhang Jian, Gindulyte
Asta, Bolton Evan E (2019) Pug-view: programmatic access to chemical
annotations integrated in PubChem. J Cheminform 11(1):1–11
21. Dean JA et al (1999) Lange’s handbook of chemistry, 5th edn. Universitas
Of Tennese Knoxville, Mc. Graw Hill Inc, New York
22. Wasburn WE (2003) International critical tables of numerical data, physics,
References
chemistry and technology, 1st edn. Knovel, Norwich
1. Conte E, Martinho A, Matos HA, Gani R (2008) Combined group-contri-
23. Rumble John R (2022) CRC handbook of chemistry and physics, 103rd
bution and atom connectivity index-based methods for estimation of
edn. CRC Press, Boca Raton
surface tension and viscosity. Ind Eng Chem Res 47(20):7940–7954
24. Manivannan RG, Mohammad S, McCarley K, Cai T, Aichele C (2019) A
2. Goussard V, Duprat F, Ploix J-L, Dreyfus G, Nardello-Rataj V, Aubry J-M
new test system for distillation efficiency experiments at elevated liquid
(2020) A new machine-learning tool for fast estimation of liquid viscosity:
viscosities: vapor-liquid equilibrium and liquid viscosity data for cyclo-
application to cosmetic oils. J Chem Inf Model 60(4):2012–2023
pentanol+ cyclohexanol. J Chem Eng Data 64(2):696–705
3. Chen Y, Peng B, Kontogeorgis GM, Liang X (2022) Machine learning
25. Chen X, Jin S, Dai Y, Jianzhou W, Guo Y, Lei Q, Fang W (2019) Densities
for the prediction of viscosity of ionic liquid-water mixtures. J Mol Liq
and viscosities for the ternary system of decalin+ methylcyclohexane+
350:118546
cyclopentanol and corresponding binaries at t= 293.15 to 343.15 k. J
4. Dajnowicz S, Agarwal G, Stevenson JM, Jacobson LD, Ramezanghorbani
Chem Eng Data 64(4):1414–1424
F, Leswing K, Friesner RA, Halls MD, Abel R (2022) High-dimensional
26. Burk V, Pollak S, Quinones-Cisneros SE, Schmidt KAG (2021) Complemen-
neural network potential for liquid electrolyte simulations. J Phys Chem B
tary experimental data and extended density and viscosity reference
126(33):6271–6280
models for squalane. J Chem Eng Data 66(5):1992–2005
Chew et al. Journal of Cheminformatics (2024) 16:31 Page 14 of 14
27. Bright Norman FH, Hutchison H, Smith D (1946) The viscosity and density 49. Chao L, Chuanjie W, Ghoreishi D, Chen W, Wang L, Damm W, Ross GA,
of sulphuric acid and oleum. J Soc Chem Ind 65(12):385–388 Dahlgren MK, Russell E, Von Bargen CD et al (2021) Opls4: improving force
28. Segur JB, Oberstar HE (1951) Viscosity of glycerol and its aqueous solu- field accuracy on challenging regimes of chemical space. J Chem Theor
tions. Ind Eng Chem 43(9):2117–2120 Comput 17(7):4291–4300
29. Landrum G et al. (2010) Rdkit. Q2.https://www.rdkit.org/. Accessed Jan – 50. Zahrt AF, Henle JJ, Denmark SE (2020) Cautionary guidelines for
Apr 2023 machine learning studies with combinatorial datasets. ACS Comb Sci
30. Ward L, Dunn A, Faghaninia A, Zimmermann NE, Bajaj S, Wang Q, Mon- 22(11):586–591
toya J, Chen J, Bystrom K, Dylla M et al (2018) Matminer: an open source 51. Dixon SL, Duan J, Smith E, Von Bargen CD, Sherman W, Repasky MP
toolkit for materials data mining. Comput Mater Sci 152:60–69 (2016) AutoQSAR: an automated machine learning tool for best-practice
31. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y (2017) quantitative structure-activity relationship modeling. Future Med Chem
Lightgbm: a highly efficient gradient boosting decision tree. In: Guyon 8(15):1825–1839
I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R 52. Lundberg SM, Lee S-I (2017) A unified approach to interpreting model
(eds) Advances in neural information processing systems, vol 30. Curran predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R,
Associates Inc, New York Vishwanathan S, Garnett R (eds) Advances in Neural Information Process-
32. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: ing Systems vol. 30. Curran Associates, Inc., pp 4765–4774. http://papers.
Proceedings of the 22nd ACM SIGKDD International Conference on nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predi
Knowledge Discovery and Data Mining, KDD’16, ACM, New York. pp ctions.pdf. Accessed Jan – Apr 2023
785–794. https://doi.org/10.1145/2939672.2939785 53. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz
33. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, R, Himmelfarb J, Bansal N, Lee S-I (2020) From local explanations to
Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, global understanding with explainable ai for trees. Nat Mach Intell
Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: 2(1):2522–5839
machine learning in Python. J Mach Learn Res 12:2825–2830 54. Molnar C (2022) Interpretable machine learning. 2nd edn. https://chris
34. Yang Y, Yao K, Repasky MP, Leswing K, Abel R, Shoichet BK, Jerome SV tophm.github.io/interpretable-ml-book. Accessed Jan – Apr 2023
(2021) Efficient exploration of chemical space with docking and deep 55. Rodríguez-Pérez R, Bajorath J (2019) Interpretation of compound activity
learning. J Chem Theor Comput 17(11):7106–7119 predictions from complex machine learning models using local approxi-
35. Benchmark study of deepautoqsar, chemprop, and deeppurpose on the mations and shapley values. J Med Chem 63(16):8761–8777
admet subset of the therapeutic data commons (2022) https://www. 56. Bannigan P, Bao Z, Hickman RJ, Aldeghi M, Häse F, Aspuru-Guzik A, Allen
schrodinger.com/sites/default/files/22_086_machine_learning_white_ C (2023) Machine learning models to accelerate the design of polymeric
paper_r4-1.pdf. Accessed 4 May 2024 long-acting injectables. Nat Commun 14(1):35
36. Kipf TN, Welling M (2016) Semi-supervised classification with graph 57. Afzal MAF, Sonpal A, Haghighatlari M, Schultz AJ, Hachmann J (2019)
convolutional networks. arXiv preprint arXiv:1609.02907 A deep neural network model for packing density predictions and its
37. Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru- application in the study of 1.5 million organic molecules. Chem Sci
Guzik A, Adams RP (2015) Convolutional networks on graphs for learning 10(36):8374–8383
molecular fingerprints. In: Advances in neural information processing 58. Wellawatte GP, Gandhi HA, Seshadri A, White AD (2022) A perspective
systems, p 28 on explanations of molecular prediction models. J Chem Theor Comput.
38. Knyazev B, Taylor GW, Amer M (2019) Understanding attention and gen- https://doi.org/10.1021/acs.jctc.2c01235
eralization in graph neural networks. In: Advances in neural information 59. Sanchez-Lengeling B, Wei J, Lee B, Reif E, Wang P, Qian W, McCloskey
processing systems, p 32 K, Colwell L, Wiltschko A (2020) Evaluating attribution for graph neural
39. Hamilton W, Ying Z, Leskovec J (2017) Inductive representation learning networks. Adv Neural Inf Process Syst 33:5898–5910
on large graphs. In: Advances in neural information processing systems, p 60. Huang Q, Yamada M, Tian Y, Singh D, Chang Y (2022) Graphlime: local
30 interpretable model explanations for graph neural networks. IEEE Trans
40. Xu K, Hu W, Leskovec J, Jegelka S (2018) How powerful are graph neural Knowl Data Eng
networks? arXiv preprint arXiv:1810.00826, 61. Weber JK, Morrone JA, Bagchi S, Estrada JD, Pabon SK, Zhang L, Cornell
41. Lee J, Lee I, Kang J (2019) Self-attention graph pooling. In: International WD (2022) Simplified, interpretable graph convolutional neural networks
conference on machine learning, PMLR. pp 3734–3743 for small molecule activity prediction. J Comput-Aided Mol Des. https://
42. Diehl F (2019) Edge contraction pooling for graph neural networks. arXiv doi.org/10.1007/s10822-021-00421-6
preprint arXiv:1905.10990 62. Rodríguez-Pérez R, Bajorath J (2020) Interpretation of machine learning
43. Vinyals O, Bengio S, Kudlur M (2015) Order matters: sequence to models using shapley values: application to compound potency and
sequence for sets. arXiv preprint arXiv:1511.06391 multi-target activity predictions. J Comput-Aided Mol Des 34:1013–1026
44. Zhang M, Cui Z, Neumann M, Chen Y (2018) An end-to-end deep learn- 63. Bonchev D, Trinajstić N (1977) Information theory, distance matrix, and
ing architecture for graph classification. In: Proceedings of the AAAI molecular branching. J Chem Phys 67(10):4517–4533
conference on artificial intelligence, vol. 32 64. Qun-Fang L, Yu-Chun H, Rui-Sen L (1997) Correlation of viscosities of pure
45. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, liquids in a wide temperature range. Fluid Ph Equilib 140(1–2):221–231
Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, 65. Miller AA (1963) “Free volume’’ and the viscosity of liquid water. J Chem
Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: Phys 38(7):1568–1571
an imperative style, high-performance deep learning library. In Advances 66. Kim SC, Oyakhire ST, Athanitis C, Wang J, Zhang Z, Zhang W, Boyle DT,
in Neural Information Processing Systems vol. 32. Curran Associates, Inc., Kim MS, Yu Z, Gao X et al (2023) Data-driven electrolyte design for lithium
pp 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imper metal anodes. Proc Natl Acad Sci 120(10):e2214357120
ative-style-high-performance-deep-learning-library.pdf. Accessed Jan –
Apr 2023
46. Version 2022–2 Materials Science Suite (2022) Schrödinger, llc, New York. Publisher’s Note
https://www.schrodinger.com/platform/materials-science. Accessed Jan Springer Nature remains neutral with regard to jurisdictional claims in pub-
– Apr 2023 lished maps and institutional affiliations.
47. Bowers KJ, Chow E, Xu H, Dror RO, Eastwood MP, Gregersen BA, Klepeis JL,
Kolossvary I, Moraes MA, Sacerdoti FD, et al (2006) Scalable algorithms for
molecular dynamics simulations on commodity clusters. In: Proceedings
of the 2006 ACM/IEEE Conference on Supercomputing, p. 84
48. Afzal MAF, Browning AR, Goldberg A, Halls MD, Gavartin JL, Morisato
T, Hughes TF, Giesen DJ, Goose JE (2020) High-throughput molecular
dynamics simulations and validation of thermophysical properties of
polymers for various applications. ACS Appl Polym Mater 3(2):620–630