learnMET - An R Package To Apply Machine Learning Methods For Genomic Prediction Using Multi-Environment Trial Data
learnMET - An R Package To Apply Machine Learning Methods For Genomic Prediction Using Multi-Environment Trial Data
https://doi.org/10.1093/g3journal/jkac226
Advance Access Publication Date: 19 September 2022
Software and Data Resources
*Corresponding author: Division of Plant Breeding Methodology, Department of Crop Sciences, University of Goettingen, Carl-Sprengel-Weg 1, 37075, Goettingen,
Germany. Email: [email protected]; *Corresponding author: Division of Plant Breeding Methodology, Department of Crop Sciences, University of
Goettingen, Carl-Sprengel-Weg 1, 37075, Goettingen, Germany. Email: [email protected]
Abstract
We introduce the R-package learnMET, developed as a flexible framework to enable a collection of analyses on multi-environment trial
breeding data with machine learning-based models. learnMET allows the combination of genomic information with environmental data
such as climate and/or soil characteristics. Notably, the package offers the possibility of incorporating weather data from field weather sta-
tions, or to retrieve global meteorological datasets from a NASA database. Daily weather data can be aggregated over specific periods of
time based on naive (for instance, nonoverlapping 10-day windows) or phenological approaches. Different machine learning methods for
genomic prediction are implemented, including gradient-boosted decision trees, random forests, stacked ensemble models, and multi-
layer perceptrons. These prediction models can be evaluated via a collection of cross-validation schemes that mimic typical scenarios
encountered by plant breeders working with multi-environment trial experimental data in a user-friendly way. The package is published
under an MIT license and accessible on GitHub.
Keywords: multienvironment trials; machine learning; genotype environment interaction; genomic prediction; R software
weather-based covariables, based on daily weather data, the METData, required as input for all other functionalities of the
user can set the compute_climatic_ECs argument to TRUE, and package.
two possibilities are given. The first one is to provide raw daily
weather data (with the raw_weather_data argument), which will Machine learning-based models implemented
undergo a quality control with the generation of an output file Different machine learning-based regression methods are
with flagged values. The second possibility, if the user does provided as S3 classes in an object-oriented programming
not have weather data available from measurements (e.g. style. These methods are called within the pipeline of the
from an in-field weather station), is the retrieval of daily predict_trait_MET_cv() function, that is presented in the following
weather records from the NASA’s Prediction of Worldwide section. In particular, the XGBoost gradient boosting library
Energy Resources (NASA POWER) database (https://power.larc. (Chen and Guestrin 2016), the Random Forest algorithm (Breiman
nasa.gov/), using the package nasapower (Sparks 2018). 2001), stacked ensemble models with Lasso regularization as
Spatiotemporal information contained in the info_environments meta-learners (Van der Laan et al. 2007), and multilayer percep-
Fig. 1. Overview of the pipeline regarding integration of weather data using the function create_METData() within the learnMET package. The blue circle
signals the first step of the process, when the function is initially called. The blue boxes indicate how the arguments of the function should be given,
according to the type of datasets available to the user. The green boxes indicate a task which is run in the pipeline via internal functions of the package.
The red circle signals the final step, when the METData object is created and contains environmental covariates. Details on the quality control tests
implemented on daily weather data are provided at https://cjubin.github.io/learnMET/reference/qc_raw_weather_data.html, and on the methods to
build ECs based on aggregation of daily data at https://cjubin.github.io/learnMET/reference/get_ECs.html.
4 | G3, 2022, Vol. 12, No. 11
the output) and is fully connected to the next layer. Here, the first
Box 3. Evaluation of a prediction method using a CV
hidden layer receives the marker genotypes and the ECs as input,
scheme (i.e. METData object with phenotypic data)
computes a weighted linear summation of these inputs (i.e.
z ¼ W> X þ b, where X represent the input features, W> the vec-
> res_cv0_indica <- predict_trait_MET_cv(
tor of weights, and b the bias), and transforms the latter with a
METData ¼ METdata_indica,
nonlinear activation function fðzÞ, yielding the output of the
trait ¼“GC”,
given neuron. In the next hidden layers, each neuron (also named
prediction_method ¼“xgb_reg_1”,
node) in one layer connects with a given weight to each neuron in
cv_type ¼“cv0”,
the consecutive layer. The last hidden layer is generally con-
cv0_type ¼“leave-one-year-out”,
nected with a linear function to the output layer that consists of
seed ¼ 100,
a single node. In MLP, learning is done via backpropagation: the
path_folder ¼“/project1/indica_cv_res/cv0”)
network makes a prediction for each training instance, calculates
the parent node, across all splits for which this given predictor
Box 5. Prediction of new observations using a training
was used. Feature importances are in this case scaled between 0
set and a test set (i.e. phenotypic data not required)
and 100.
Other model-agnostic interpretation techniques have been de-
# Create a training set composed of years 2014, 2015 and
veloped, that provide the advantage of being independent from
2016:
the original machine learning algorithm applied, thereby allow-
> METdata_G2F_training <-
ing straightforward comparisons across models (Molnar 2022).
create_METData(
After shuffling the values of a given predictor variable, the value
geno ¼ geno_G2F,
of the loss function (e.g. root mean square error in regression
pheno ¼ pheno_G2F[pheno_G2F$year %in% c(2014,2015,2016),],
problems), estimated using the predictions of the shuffled data
map ¼ map_G2F,
and the observed values, can be used to obtain an estimate of the
climate_variables ¼ NULL,
compute_climatic_ECs ¼ TRUE, permutation-based variable importance. Fisher et al. (2019) for-
Fig. 2. Output results from the create_METData() function. a) Cluster analysis using K-means algorithm (K ¼ 4) to identify groups of similar environments
based on environmental data. b) Total within-cluster sum of squares as a function of the number of clusters. c) Average Silhouette score as a function
of the number of clusters. These methods can help users decide on the optimal number of clusters. Data used here are a subset of the Genomes to
Fields maize dataset (AlKhalifah et al. 2018; McFarland et al. 2020). Weather data were retrieved from NASA POWER database via the package
nasapower Sparks (2018). Plots are saved in the directory provided in the path_to_save argument.
C. C. Westhues et al. | 9
Fig. 3. Correlations between predicted and observed values for a forward prediction scenario using two machine learning models and a linear reaction
norm approach. a) Three traits predicted for two rice populations. Each year is predicted based on at least two past years of phenotypic data (one single
location). b) Grain yield predicted for the G2F dataset. GC (rice data), percentage of chalky kernels; GY (rice data), grain yield (kg/ha); PHR (rice data),
percentage of head rice recovery; GY (G2F), bushels per acre.
environmental variables (sum of photothermal time and fre- be used to gain insights into a model’s predictions using the
quency of rainfall) influence the average prediction of maize package.
grain yield using ALE plots. We should stress that the size of the
dataset employed here is likely too small to make real inferences
about the relationship between the predictor variables and Concluding remarks and future developments
the outcome (sharp drops observed at some feature values). learnMET was developed to make the integration of complex data-
Our goal here is essentially to illustrate how these functions can sets, originating from various data sources, user-friendly. The
10 | G3, 2022, Vol. 12, No. 11
Fig. 4. Root mean square error between predicted and observed values for a forward prediction scenario using two machine learning models and a
linear reaction norm approach. a) Three traits predicted for two rice populations. Each year is predicted based on at least two past years of phenotypic
data (one single location). b) Grain yield predicted for the G2F dataset. GC (rice data), percentage of chalky kernels; GY (rice data), grain yield (kg/ha);
PHR (rice data), percentage of head rice recovery; GY (G2F), bushels per acre.
package provides flexibility at various levels: (1) regarding the use environmental features via the argument list_env_predictors in
of weather data, with the possibility to provide on-site weather predict_trait_MET_cv()).
station data, or to retrieve external weather data, or a mix of To allow analyses on larger datasets, future developments of
both if on-site data are only partially available; (2) regarding how the package should include parallel processing to improve the
time intervals for aggregation of daily weather data are defined; scalability of the package and to best harness high performance
(3) regarding the diversity of nonlinear machine learning computing resources. Improvements and extensions of stacked
models proposed; (4) regarding options to provide manually models and deep learning models are also intended, as we did
specified subsets of predictor variables (for instance, for not investigate in-depth the network architecture (e.g. number of
C. C. Westhues et al. | 11
Fig. 5. Model interpretation methods applied on the model fitted to a subset of the G2F dataset from years 2014 to 2016 (17 environments included) with
xgb_reg_1 for the trait grain yield. a) Model-agnostic variable importance using 10 permutations. The top 40 most important predictor variables are
displayed, and the table containing results across all permutations for all variables is returned. ALE plots for (b) sum of photothermal time during the
1st day-interval of the growing season, and (c) the frequency of days with an amount of precipitation above 10 mm during the 2nd day-interval of the
growing season. Tick marks indicate the unique values observed for the given covariate in the training set.
12 | G3, 2022, Vol. 12, No. 11
nodes per layer, type of activation function, type of optimizer), Chen T, Guestrin C. Xgboost: a scalable tree boosting system.
nor other types of deep learning models that might perform bet- Proceedings of the 22nd ACM SIGKDD International Conference
ter (e.g. convolutional neural networks). Finally, the package on Knowledge Discovery and Data Mining, San Francisco, CA,
could be extended to allow genotype-specific ECs, because the USA. 2016. p. 785–794.
timing of developmental stages differs across genotypes (e.g. due Chollet F, et al. 2015. Keras. https://keras.io.
to variability in earliness) and should ideally be taken into ac- Costa-Neto G, Fritsche-Neto R, Crossa J. Nonlinear kernels, domi-
count. nance, and envirotyping data increase the accuracy of genome-
based prediction in multi-environment trials. Heredity. 2021;
126(1):92–106.
Data availability Costa-Neto G, Galli G, Carvalho HF, Crossa J, Fritsche-Neto R.
The software is available on GitHub at https://github.com/cjubin/ Envrtype: a software to interplay enviromics and quantitative ge-
learnMET. Documentation and vignettes are provided at https:// nomics in agriculture. G3 (Bethesda). 2021;11:jkab040.
Kuhn M, Wickham H. Tidymodels: a collection of packages for Rincent R, Malosetti M, Ababaei B, Touzy G, Mini A, Bogard M, Martre
modeling and machine learning using tidyverse principles. 2020. P, Le Gouis J, van Eeuwijk F. Using crop growth model stress cova-
Retrieved from https://CRAN.R-project.org/package¼tidymodels riates and AMMI decomposition to better predict genotype-by-
McFarland BA, AlKhalifah N, Bohn M, Bubert J, Buckler ES, Ciampitti environment interactions. Theor Appl Genet. 2019;132(12):
I, Edwards J, Ertl D, Gage JL, Falcon CM, et al. Maize genomes to 3399–3411.
fields (G2F): 2014–2017 field seasons: genotype, phenotype, cli- Ritchie MD, White BC, Parker JS, Hahn LW, Moore JH. Optimization of
matic, soil, and inbred ear image datasets. BMC Res Notes. 2020; neural network architecture using genetic programming
13(1):1–6. improves detection and modeling of gene-gene interactions in
McKinney BA, Reif DM, Ritchie MD, Moore JH. Machine learning for studies of human diseases. BMC Bioinformatics. 2003;4(1):28.
detecting gene-gene interactions. Appl Bioinformatics. 2006;5(2): Runcie DE, Qu J, Cheng H, Crawford L. Megalmm: mega-scale linear
77–88. mixed models for genomic predictions with thousands of traits.
Molnar C. Interpretable Machine Learning - A Guide for Making Genome Biol. 2021;22(1):1–25.