0% found this document useful (0 votes)
17 views

learnMET - An R Package To Apply Machine Learning Methods For Genomic Prediction Using Multi-Environment Trial Data

Uso de ML con para la predicción de mutaciones

Uploaded by

german.gc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

learnMET - An R Package To Apply Machine Learning Methods For Genomic Prediction Using Multi-Environment Trial Data

Uso de ML con para la predicción de mutaciones

Uploaded by

german.gc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

G3, 2022, 12(11), jkac226

https://doi.org/10.1093/g3journal/jkac226
Advance Access Publication Date: 19 September 2022
Software and Data Resources

learnMET: an R package to apply machine learning methods


for genomic prediction using multi-environment trial data
Cathy C. Westhues,1,2,* Henner Simianer,2,3 Timothy M. Beissinger 1,2,
*
1
Division of Plant Breeding Methodology, Department of Crop Sciences, University of Goettingen, 37075 Goettingen, Germany,

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024


2
Center for Integrated Breeding Research, University of Goettingen, 37075 Goettingen, Germany,
3
Animal Breeding and Genetics Group, Department of Animal Sciences, University of Gottingen, 37075 Gottingen, Germany

*Corresponding author: Division of Plant Breeding Methodology, Department of Crop Sciences, University of Goettingen, Carl-Sprengel-Weg 1, 37075, Goettingen,
Germany. Email: [email protected]; *Corresponding author: Division of Plant Breeding Methodology, Department of Crop Sciences, University of
Goettingen, Carl-Sprengel-Weg 1, 37075, Goettingen, Germany. Email: [email protected]

Abstract
We introduce the R-package learnMET, developed as a flexible framework to enable a collection of analyses on multi-environment trial
breeding data with machine learning-based models. learnMET allows the combination of genomic information with environmental data
such as climate and/or soil characteristics. Notably, the package offers the possibility of incorporating weather data from field weather sta-
tions, or to retrieve global meteorological datasets from a NASA database. Daily weather data can be aggregated over specific periods of
time based on naive (for instance, nonoverlapping 10-day windows) or phenological approaches. Different machine learning methods for
genomic prediction are implemented, including gradient-boosted decision trees, random forests, stacked ensemble models, and multi-
layer perceptrons. These prediction models can be evaluated via a collection of cross-validation schemes that mimic typical scenarios
encountered by plant breeders working with multi-environment trial experimental data in a user-friendly way. The package is published
under an MIT license and accessible on GitHub.

Keywords: multienvironment trials; machine learning; genotype  environment interaction; genomic prediction; R software

Introduction nonlinear arc-cosine kernels aiming at reproducing a deep


learning approach (Cuevas et al. 2019; Costa-Neto, Fritsche-Neto,
Large amounts of data from various sources (phenotypic records
et al. 2021), and to harness environmental data retrieved by the
from field trials, genomic or omics data, environmental informa-
package.
tion) are regularly gathered as part of multi-environment trials
While Bayesian approaches have been successful at dramati-
(MET). The efficient exploitation of these extensive datasets has
cally improving predictive ability in multi-environment breeding
become of utmost interest for breeders to address essentially two
experiments (Cuevas et al. 2017, 2019; Costa-Neto, Fritsche-Neto,
objectives: (1) accurately predicting genotype performance in fu-
et al. 2021), data-driven machine learning algorithms represent
ture environments; (2) untangling complex relationships between
genetic markers, environmental covariables (ECs), and pheno- alternative predictive modeling techniques with increased flexi-
types to better understand the pervasive phenomenon of bility with respect to the form of the mapping function between
genotype-by-environment (G  E) interaction. input and output variables. In particular, nonlinear effects in-
Many R packages have recently been developed that allow to cluding gene  gene and genotype  environment (G  E) interac-
implement genomic prediction models accounting for G  E tions can be captured with machine learning models (Ritchie
effects using mixed models: BGLR (Pérez and de Los Campos et al. 2003; McKinney et al. 2006; Crossa et al. 2019; Westhues et al.
2014), sommer (Covarrubias-Pazaran 2016), Bayesian Genomic 2021). G  E interactions are of utmost interest for plant breeders,
Genotype  Environment Interaction (BGGE) (Granato et al. 2018), especially when they present a crossover type, because the latter
Bayesian Multi-Trait Multi-Environment for Genomic Selection implies a change in the relative ranking of genotypes across dif-
(BMTME) (Montesinos-López et al. 2019), bWGR (Xavier et al. 2019), ferent environments. Breeders generally cope with G  E by either
EnvRtype (Costa-Neto, Galli, et al. 2021), and MegaLMM (Runcie (1) focusing their program on wide adaptation of cultivars over a
et al. 2021). BGGE presents a speed advantage compared to BGLR, target population of environments, from which follows that the
that is explained by the use of an optimization procedure for developed varieties are not the best ones for a given environment,
sparse covariance matrices, while BMTME additionally exploits and positive G  E interactions are not exploited, or (2) identifying
the genetic correlation among traits and environments to build varieties that are the best adapted to specific environments
linear G  E models. EnvRtype further widens the range of oppor- (Bernardo 2002). Enhancing the modeling of genotype-by-
tunities in Bayesian kernel models with the possibility to use environment interactions, by the inclusion of environmental

Received: April 26, 2022. Accepted: July 29, 2022


C The Author(s) 2022. Published by Oxford University Press on behalf of Genetics Society of America.
V
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
2 | G3, 2022, Vol. 12, No. 11

covariates related to critical developmental stages, also resulted Maize datasets


in an increase of predictive ability in many studies using MET A subset of phenotypic and genotypic datasets, collected and
datasets (Heslot et al. 2012; Monteverde et al. 2019; Rincent et al. made available by the G2F initiative (www.genomes2fields.org),
2019; Costa-Neto, Fritsche-Neto, et al. 2021). were integrated into learnMET. Hybrid genotypic data were com-
In this article, we describe the R-package learnMET and its puted in silico based on the GBS data from inbred parental lines.
principal functionalities. learnMET provides a pipeline to (1) facil-
For more information about the original datasets, please refer to
itate environmental characterization and (2) evaluate and com-
AlKhalifah et al. (2018) and McFarland et al. (2020). In total, pheno-
pare different types of machine learning approaches to predict
typic data, collected from 22 environments covering 4 years
quantitative traits based on relevant cross-validation (CV)
(2014–2017) and 6 different locations in American states and
schemes for MET datasets. The package offers flexibility by
Canadian provinces, are included in the package.
allowing to specify the sets of predictors to be used in predictions,
and different methods to process genomic information to model Running learnMET

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024


genetic effects.
learnMET can be implemented as a three-step pipeline. These are
To validate the predictive performance of the models, differ-
described next.
ent CV schemes are covered by the package, that aim at address-
ing concrete plant breeding prediction problems with multi- Step 1: specifying input data and processing parameters
environment field experiments. We borrow the same terminology
The first function in the learnMET pipeline is create_METData()
as in previous related studies (see Burguen ~ o et al. 2012; Jarquın
(Box 2). The user must provide genotypic and phenotypic data, as
et al. 2014, 2017), as follows: (1) CV1: predicting the performance
well as basic information about the field experiments (e.g. longi-
of newly developed genotypes (never tested in any of the environ-
tude, latitude, planting, and harvest date). Missing genotypic
ments included in the MET); (2) CV2: predicting the performance
data should be imputed beforehand. Climate covariables can be
of genotypes that have been tested in some environments but
directly provided as day-interval-aggregated variables, using the
not in others (also referred to as field sparse testing); (3) CV0: pre-
argument climate_variables. Alternatively, in order to compute
dicting the performance of genotypes in new environments, i.e.
the environment has not been tested; and (4) CV00: predicting
the performance of newly developed genotypes in new environ-
ments, i.e. both environment and genotypes have not been ob- Box 2. Integration of input data in a METData list object
served in the training set. For CV0 and CV00, four configurations
Case 1: ECs directly provided by the user
are implemented: leave-one-environment-out, leave-one-site-
> library(learnMET)
out, leave-one-year-out, and forward prediction.
> data(geno_indica)
> data(map_indica)
Methods > data(pheno_indica)
Installation and dependencies > data(info_environments_indica)
> data(env_data_indica)
Using the devtools package (Wickham et al. 2021), learnMET can
> METdata_indica <- create_METData(
be easily installed from GitHub and loaded (Box 1).Dependencies
geno ¼ geno_indica,
are automatically installed or updated when executing the com-
map ¼ map_indica,
mand above.
pheno ¼ pheno_indica,
Real multi-environment trial datasets climate_variables ¼ climate_variables_indica,
Three toy datasets are included with the learnMET package to il- info_environments ¼ info_environments_indica,
lustrate how input data should be provided by the user and how compute_climatic_ECs ¼ FALSE,
the different functionalities of the package can be utilized. path_to_save ¼“/learnMET_analyses/indica”)
Case 2: daily climate data automatically retrieved and ECs
Rice datasets calculated via the package
The datasets were obtained from the INIA’s Rice Breeding Program > data(geno_G2F)
(Uruguay) and were used in previous studies (Monteverde et al. > data(pheno_G2F)
2018, 2019). We used phenotypic data for three traits from two > data(map_G2F)
breeding populations of rice (indica, composed of 327 elite breeding > data(info_environments_G2F)
lines; and japonica, composed of 320 elite breeding lines). The two > data(soil_G2F)
populations were evaluated at a single location (Treinta y Tres, > METdata_g2f <- create_METData(
Uruguay) across multiple years (2010–2012 for indica and 2009– geno ¼ geno_G2F,
2013 for japonica) and were genotyped using genotyping-by- pheno ¼ pheno_G2F,
sequencing (GBS) (Monteverde et al. 2019). ECs, characterizing three map ¼ map_G2F,
developmental stages throughout the growing season, were di- climate_variables ¼ NULL,
rectly available. More details about the dataset are given in raw_weather_data ¼ NULL,
Monteverde et al. (2018). compute_climatic_ECs ¼ TRUE,
info_environments ¼ info_environments_G2F,
soil_variables ¼ soil_G2F,
Box 1. Install learnMET path_to_save ¼“/learnMET_analyses/G2F”)
Note: code example to use in-field daily weather data pro-
> devtools::install_github(“cjubin/learnMET”) vided at https://cjubin.github.io/learnMET/articles/vignette_
> library(learnMET) getweatherdata.html
C. C. Westhues et al. | 3

weather-based covariables, based on daily weather data, the METData, required as input for all other functionalities of the
user can set the compute_climatic_ECs argument to TRUE, and package.
two possibilities are given. The first one is to provide raw daily
weather data (with the raw_weather_data argument), which will Machine learning-based models implemented
undergo a quality control with the generation of an output file Different machine learning-based regression methods are
with flagged values. The second possibility, if the user does provided as S3 classes in an object-oriented programming
not have weather data available from measurements (e.g. style. These methods are called within the pipeline of the
from an in-field weather station), is the retrieval of daily predict_trait_MET_cv() function, that is presented in the following
weather records from the NASA’s Prediction of Worldwide section. In particular, the XGBoost gradient boosting library
Energy Resources (NASA POWER) database (https://power.larc. (Chen and Guestrin 2016), the Random Forest algorithm (Breiman
nasa.gov/), using the package nasapower (Sparks 2018). 2001), stacked ensemble models with Lasso regularization as
Spatiotemporal information contained in the info_environments meta-learners (Van der Laan et al. 2007), and multilayer percep-

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024


argument is required. Note that the function also checks which trons (MLP) using Keras (Chollet et al. 2015) are implemented as
environments are characterized by in-field weather data in the prediction methods. In this section, we briefly present how these
raw_weather_data argument, in order to retrieve satellite-based machine learning algorithms work.
weather data for the remaining environments without in-field Gradient-boosted decision trees (GBDT) can be seen as an ad-
weather stations. An overview of the pipeline is provided in ditive regression model, where the final model is an ensemble of
Fig. 1. weak learners (i.e. a regression tree in this case), in which each
Some covariates are additionally computed, based on the daily base learner is fitted in a forward sequential manner (Friedman
weather data, such as vapor pressure deficit or the reference 2001). Considering a certain loss function (e.g. mean-squared er-
evapotranspiration using the Penman-Monteith (FAO-56) equa- ror for regression), a new tree is fitted to the residuals of the prior
tion. The aggregation of daily information into day-interval- model (i.e. an ensemble of trees) to minimize this loss function.
based values is also carried out within this function. Four meth- Then, the previous model is subsequently updated with the cur-
ods are available and should be specified with the argument rent model. From this definition, it becomes clear that GBDT and
method_ECs_intervals: (1) default: use of a definite number of inter- Random Forest models strongly differ from each other, since for
vals across all environments (i.e. the window length varies GBDT, trees are built conditional on past trees, and the trees con-
according to the duration of the growing season); (2) use of day- tribute unequally to the final model (Kuhn et al. 2013).
windows of fixed length (i.e. each window spans a given number In contrast, in Random Forest algorithms, trees are created in-
of days, which remains identical across environments), that can dependently from each other, and results from each tree are only
be adjusted by the user; (3) use of specific day intervals according combined at the end of the process. The concept of GBDT was
to each environment provided by the user, which should corre- originally developed by Friedman (2001). In learnMET, a set of pre-
spond to observed or assumed relevant phenological intervals; diction models, denoted xgb_reg and rf_reg, is proposed that use
and (4) based on the estimated crop growth stage within each en- the XGBoost algorithm or the Random Forest algorithm, respec-
vironment using accumulated growing degree-days in degrees tively, with different input variables.
Celsius. An MLP consists of one input layer, one or more hidden layers,
Besides weather-based information, soil characterization and one output layer. Each layer, with the exception of the final
for each environment can also be provided given the soil_variables output layer, includes a bias neuron (i.e. a constant value that
argument. The output of create_METData() is a list object of class acts like the intercept in a linear equation and is used to adjust

Fig. 1. Overview of the pipeline regarding integration of weather data using the function create_METData() within the learnMET package. The blue circle
signals the first step of the process, when the function is initially called. The blue boxes indicate how the arguments of the function should be given,
according to the type of datasets available to the user. The green boxes indicate a task which is run in the pipeline via internal functions of the package.
The red circle signals the final step, when the METData object is created and contains environmental covariates. Details on the quality control tests
implemented on daily weather data are provided at https://cjubin.github.io/learnMET/reference/qc_raw_weather_data.html, and on the methods to
build ECs based on aggregation of daily data at https://cjubin.github.io/learnMET/reference/get_ECs.html.
4 | G3, 2022, Vol. 12, No. 11

the output) and is fully connected to the next layer. Here, the first
Box 3. Evaluation of a prediction method using a CV
hidden layer receives the marker genotypes and the ECs as input,
scheme (i.e. METData object with phenotypic data)
computes a weighted linear summation of these inputs (i.e.
z ¼ W>  X þ b, where X represent the input features, W> the vec-
> res_cv0_indica <- predict_trait_MET_cv(
tor of weights, and b the bias), and transforms the latter with a
METData ¼ METdata_indica,
nonlinear activation function fðzÞ, yielding the output of the
trait ¼“GC”,
given neuron. In the next hidden layers, each neuron (also named
prediction_method ¼“xgb_reg_1”,
node) in one layer connects with a given weight to each neuron in
cv_type ¼“cv0”,
the consecutive layer. The last hidden layer is generally con-
cv0_type ¼“leave-one-year-out”,
nected with a linear function to the output layer that consists of
seed ¼ 100,
a single node. In MLP, learning is done via backpropagation: the
path_folder ¼“/project1/indica_cv_res/cv0”)
network makes a prediction for each training instance, calculates

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024


the error associated with this prediction, estimates the error con-
tribution from each connection at each hidden layer by iterating
backward from the last layer (reverse pass), and finally changes processed (e.g. standardization and removal of predictors with
the connection weights to decrease this error, usually using gra- null variance, feature extraction based on principal component
dient descent step (Géron 2019). For more details about deep analysis), and the corresponding test set is processed using the
learning methods in genomic prediction, we refer to the review same transformations. Performance metrics are computed on
written by Pérez-Enciso and Zingaretti (2019). In learnMET, a set of the test set, such as the Pearson correlation between predicted
prediction models named DL_reg, are proposed that apply MLP and observed phenotypic values (always calculated within the
models with different input variables. same environment, regardless of how the test sets are
Stacked models can be understood as an ensemble method defined according to the different CV schemes), and the root
that exploits the capabilities of many well-working models mean square error. Analyses are fully reproducible given that
(called base learners) on a classification or regression task. The seed and tuned hyperparameters are stored with the output of
theoretical background of this method was originally proposed predict_trait_MET_cv(). Note that, if one wants to compare models
by Breiman (1996), and further developed by Van der Laan et al. using the same CV partitions, specifying the seed and modifying
(2007). In the first step, different individual base learners are fit- the model would be sufficient.
ted to the same training set resamples (typically generated via The function applies a nested CV to obtain an unbiased gener-
CV), and potentially using different sets of predictor variables or alization performance estimate. After splitting the complete
different hyperparameter settings. Then, the predictions of the dataset using an outer CV partition (based on either CV1, CV2,
base learners are used as input to predict the output by fitting a CV0, or CV00 prediction problems), an inner CV scheme is applied
regularization method, such as Lasso, on the cross-validated pre- to the outer training dataset for optimization of hyperpara-
dictions. Hence, the final model has learned how to combine the meters. Subsequently, the best hyperparameters are selected and
first-level predictions of the base learners, and this stacked en- used to train the model using all training data. Model perfor-
semble is expected to achieve similar or better results than any mance is then evaluated based on the predictions of the unseen
of the base learners (Van der Laan et al. 2007). This implies also test data using this trained model. This procedure is repeated for
that some weak learners, trained in the first stage, are generally each training-test partition of the outer CV assignments. Table 1
excluded by variable selection from the resulting ensemble shows the different arguments that can be adjusted when exe-
model if their predictions are highly correlated with other mod- cuting the CV evaluation.
els, or irrelevant for predicting the trait of interest. In learnMET, Note that the classes we developed for preprocessing data and
prediction models named stacking_reg apply stacked ensemble for fitting machine learning-based methods use functions from
models with different base learners and input variables. For in- the tidymodels collection of R packages for machine learning
stance, stacking_reg_3 combines a support vector machine regres- (Kuhn and Wickham 2020), such as Bayesian optimization to tune
sion model fitted to the ECs, an elastic net model fitted to the hyperparameters (function tune_bayes()) or the package stacks. For
SNPs data, and a XGBoost model using as features the 40 models based on XGBoost, the number of boosting iterations, the
genomic-based PCs and the ECs. The stacked model was designed learning rate, and the depth of trees represent important hyper-
to embrace individual learners as diverse as possible, in order to parameters that are automatically tuned. Ranges of hyperpara-
improve the likelihood that the predictions of the different mod- meter values are predefined based on expert knowledge. Bayesian
els are different from each other, and that the meta learning al- optimization techniques use a surrogate model of the objective
gorithm really benefits from combining these first-level function in order to select better hyperparameter combinations
predictions. Regularized regression methods are widely used for based on past results (Shahriari et al. 2016). As more combinations
genomic selection (Zou and Hastie 2005; de los Campos et al. are assessed, more data become available from which this surro-
2013), thus our choice to incorporate Elastic Net as an individual gate model can learn to sample new combinations from the
learner to estimate the SNPs effects. hyperparameter space that are more likely to yield an improve-
ment. This technique allows a reduction of the number of model
Step 2: model evaluation through cross-validation settings tested during the hyperparameter tuning.
The second function in a typical workflow is predict_trait_MET_cv()
(Box 3). The goal of this function is to assess a given prediction
method with a specific CV scenario that mimic concrete plant Extracting evaluation metrics from the output
breeding situations. Once a model has been evaluated with a CV scheme, various
When predict_trait_MET_cv() is executed, a list of training/test results can be extracted from the returned object, as shown in
splits is constructed according to the CV scheme chosen by the Box 4, and plots for visualization of results are also saved in the
user. Each training set in each sub-element of this list is path_folder.
C. C. Westhues et al. | 5

Table 1. Description of the main arguments used with the


Box 4. Extraction of results from returned object of class
function predict_trait_MET_cv().
met_cv
Function argument Description
# Extract predictions for each test set in the CV scheme:
METData An object created by the initial function
of the package create_METData(). > pred_2010 <- res_cv0_indica$list_results_cv[[1]]$prediction_df
trait Name of the trait to predict. > pred_2011 <- res_cv0_indica$list_results_cv[[2]]$prediction_df
prediction_method String to name the trait to predict. > pred_2012 <- res_cv0_indica$list_results_cv[[3]]$prediction_df
lat_lon_included Logical to use longitude and latitude as
predictor variables. FALSE by default. # The length of the list_results_cv sub-element is equal to
yr_included Logical to use yr effect as dummy vari- the number of train/test sets partitions.
able. FALSE by default.
cv_type String indicating the CV scheme to use # Extract Pearson correlation between predicted and
among “cv0” (prediction of genotypes observed values for 2010:

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024


in new environments), “cv00” (predic- > cor_2010 <- res_cv0_indica$list_results_cv[[1]]$cor_pred_obs
tion of new genotypes in new environ-
ments), “cv1” (prediction of new # Extract root mean square error between predicted and ob-
genotypes), or “cv2” (prediction of in- served values for 2011:
complete field trials). Default is “cv0.”
cv0_type String indicating the type of cv0 sce- > rmse_2011 <- res_cv0_indica$list_results_cv[[2]]$rmse_
nario, among “leave-one-environ- pred_obs
ment-out”, “leave-one-site-out”,
“leave-one-yr-out”, and “forward-pre- # Get the seed used:
diction.” Default is “leave-one-envi- > seed <- res_cv0_indica$seed_used
ronment-out.”
nb_folds_cv1 Integer for the number of folds to use in
the cv1 scheme, if selected.
repeats_cv1 Integer for the number of repeats in the user can either provide a table of environments, with longitude,
cv1 scheme, if selected. latitude, and growing season dates, or can directly provide a table
nb_folds_cv2 Integer for the number of folds to use in
the cv2 scheme, if selected. of ECs that should be consistent with the ECs provided for the
repeats_cv2 Integer for the number of repeats in the training set. Environmental variables for the unobserved test set
cv2 scheme, if selected. should be provided or computed with the same aggregation
include_env_predictors Logical to indicate if ECs should be used method (i.e. same method_ECs_intervals) as for the training set. To
in predictions. TRUE by default.
build an appropriate model with learning parameters, able to
list_env_predictors Vector of character strings with the
names of the environmental predic- generalize well on new data, a hyperparameter optimization with
tors which should be used in predic- CV is conducted on the entire training dataset when using the
tions. NULL by default, which means function predict_trait_MET().
that all environmental predictor vari-
This function can potentially be applied to harness historical
ables are used.
seed Integer with the seed value. Default is weather data and to obtain predictions across multiple years at a
NULL, which implies that a random set of given locations (de Los Campos et al. 2020), or to conjecture
seed is generated, used in the other about the best selection candidates to assess in field trials at spe-
stages of the pipeline, and given as cific locations. However, we emphasize the importance of both
output for reproducibility.
environmental and genetic similarity between training and test
save_processing Logical to save the processing steps used
to build the model in a RDS file. sets. If the selection candidates within the test set are not
Default is FALSE. strongly genetically related to the genotypes included in the
path_folder String to indicate the full path where the training set, or if the climatic conditions experienced in the test
RDS file with results and plots gener-
set differ too much from the feature space covered within the
ated during the analysis should be
saved. training set, the prediction results might not be trustworthy for
num_pcs Optional argument. Integer to indicate decision making.
the number of PCs to derive from the The function analysis_predictions_best_genotypes() takes directly
genotype matrix or from the genomic the output of predict_trait_MET() and can be used to visualize the
relationship matrix (encouraged to
speed up CV with large datasets). predicted yield of the best performing genotypes at each of the
save_model Logical indicating whether the fitted locations across years included in the test set.
model for each training-test partition
should be saved. Default is FALSE.
Interpreting ML models
Compared to parametric models, ML techniques are often con-
Step 3: prediction of performance for a new test set sidered as black-boxes implementations that complicate the
The third module in the package aims at implementing task of understanding the importance of different factors (ge-
predictions for unobserved configurations of genotypic and netic, environmental, management, or their respective interac-
environmental predictors using the function predict_trait_MET() tions) driving the phenotypic response. Therefore, various
(Box 5). The user needs to provide a table of genotype IDs (e.g. methods have recently been proposed to aid the understanding
name of new varieties) with their growing environments (i.e. year and interpretation of the output of ML models. Among these
and location) using the argument pheno in the function techniques, some are model-specific techniques (Molnar 2022),
create_METData(). Genotypic data of the selection candidates to in the sense that they are only appropriate for certain types of
test within this test set should all be provided using the geno ar- algorithms. For instance, the Gini importance or the gain-based
gument. Regarding characterization of new environments, the feature importance measures can only be applied for tree-based
6 | G3, 2022, Vol. 12, No. 11

the parent node, across all splits for which this given predictor
Box 5. Prediction of new observations using a training
was used. Feature importances are in this case scaled between 0
set and a test set (i.e. phenotypic data not required)
and 100.
Other model-agnostic interpretation techniques have been de-
# Create a training set composed of years 2014, 2015 and
veloped, that provide the advantage of being independent from
2016:
the original machine learning algorithm applied, thereby allow-
> METdata_G2F_training <-
ing straightforward comparisons across models (Molnar 2022).
create_METData(
After shuffling the values of a given predictor variable, the value
geno ¼ geno_G2F,
of the loss function (e.g. root mean square error in regression
pheno ¼ pheno_G2F[pheno_G2F$year %in% c(2014,2015,2016),],
problems), estimated using the predictions of the shuffled data
map ¼ map_G2F,
and the observed values, can be used to obtain an estimate of the
climate_variables ¼ NULL,
compute_climatic_ECs ¼ TRUE, permutation-based variable importance. Fisher et al. (2019) for-

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024


et0 ¼ T, # Possibility to calculate reference evapotranspira- mally defined the permutation importance for a variable j as fol-
j
tion with the package (if TRUE, elevation data should be lows: vipdiff ¼ Lðy; f^ðXpermuted ÞÞ  Lðy; f^ðXoriginal ÞÞ, where Lðy; f^ðXÞÞ
preferably added as a column in info_environments) is the loss function evaluating the performance of the model,
info_environments ¼ info_environments_G2F[info_ Xoriginal is the original matrix of predictor variables, and Xpermuted
environments_G2F$year %in% c(2014,2015,2016),], is the matrix obtained after permuting the variable j in Xoriginal.
soil_variables ¼ soil_G2F[soil_G2F$year %in% c(2014,2015,2016),], The reason behind this approach is that, if a predictor contributes
path_to_save ¼“/project1/g2f_trainingset”) # path where strongly to a model’s predictions, shuffling its values will result
daily weather data and plots are saved in increased error estimates. On the other hand, if the variable is
irrelevant for the fitted model, it should not affect the prediction
# Create a prediction set (same default method to compute error. It is recommended to repeat the permutation process to ob-
ECs as above): tain a more reliable average estimate of the variable importance
> METdata_G2F_new <- (Fisher et al. 2019; Molnar 2022). Another interesting aspect of
create_METData( permutation-based variable importance is the possibility to cal-
geno ¼ geno_G2F, culate it using either the training or the unused test set.
pheno ¼ as.data.frame(pheno_G2F[pheno_G2F$year %in%
Computing variable importance using unseen data is useful to
2017 , ] % >% dplyr::select(-pltht, -yld_bu_ac, -earht)),
evaluate whether the explanatory variables, identified as rele-
map ¼ map_G2F,
vant for prediction during model training, are truly important to
et0 ¼ T,
deliver accurate predictions, and whether the model does not
climate_variables ¼ NULL,
overfit. However, in the latter case, one needs to ensure that the
compute_climatic_ECs ¼ TRUE,
training and test set are sufficiently related. New data might be-
info_environments ¼ info_environments_G2F[info_
have very differently from the data used for training without im-
environments_G2F$year %in% 2017 , ],
plying that the trained model is fundamentally wrong. The
soil_variables ¼ soil_G2F[soil_G2F$year %in% 2017 , ],
function variable_importance_split() enables retrieving variable im-
path_to_save ¼“/project1/g2f_testset”,
portance, either with a model-specific method (via the package
as_test_set ¼ T) # in order to provide only predictor variables
vip proposed by Greenwell et al. 2020), when available, or based
(no phenotypic data for the test set available) in pheno argu-
on a permutation-based method (argument type, see Box 6), and
ment.
the calculation is made by default using the training set, but can
# Fitting the model to the training set and predicting the test be achieved for the test set by setting the argument unseen_data
set to TRUE.Accumulated local effects (ALE) plots, also model agnos-
> results_list <- predict_trait_MET( tic, allow to examine the influence of a given predictor variable
METData_training ¼ METdata_G2F_training, on the model prediction, conditional on the predictor value
METData_new ¼ METdata_G2F_new, (Apley and Zhu 2020). Compared to partial dependence (PD) plots,
trait ¼“yld_bu_ac”, they provide the advantage of addressing the bias that emerges
prediction_method ¼“xgb_reg_1”, when features are correlated. While predictions are computed
use_selected_markers ¼ F, over the marginal distribution of predictor variables in the case
save_model ¼ TRUE, of PD plots (i.e. meaning that predictions of unrealistic instances
# save_model set to TRUE in order to retrieve subsequently are considered), ALE plots offer a solution to this issue by consid-
variable importance ering the conditional distribution, thus avoiding to use predic-
lat_lon_included ¼ F, tions of unrealistic training observations. To build an ALE plot,
year_included ¼ F, the range of the explanatory variable is first split into equally
num_pcs ¼ 200, sized small windows, such as quantiles. For each window, the
include_env_predictors ¼ T,
ALE method only considers observations that show for this fea-
seed ¼ 100,
ture a value falling within the interval. Then, it computes model
path_folder ¼“/project1/g2f_results_year_2017”
predictions for the upper limit and for the lower limit of the inter-
)
val for these data instances, and calculates the difference in pre-
dictions. The changes of predictions are averaged within each
interval, which allows to block the impact of other features.
methods (e.g. decision trees, Random Forests, gradient-boosted These average effects are then accumulated across all intervals
trees), since it calculates how much a predictor variable can re- and centered at 0. The function ALE_plot_split() yields the ALE plot
duce the sum of squared errors in the child nodes, compared to for a given predictor variable. An example is provided in Box 6.
C. C. Westhues et al. | 7

Results and discussion Benchmarking two prediction methods from


To illustrate the use of learnMET with METs datasets, we provide
learnMET and a linear reaction norm model
here two example pipelines, both of which are available in the of- Phenotypic traits were predicted by the reaction norm
ficial package documentation. The first one demonstrates an im- model proposed by Jarquın et al. (2014), thereafter denoted as
plementation that requires no user-provided weather data, while G-W-G  W, that account for the random linear effects of
the second pipeline shows prediction results obtained based on the molecular markers (G), of the environmental covariates
(W), and of the interaction term (G  W), under the following
user-provided environmental data.
assumptions:

yij ¼ l þ gi þ wj þ gwij þ eij ;


Retrieving meteorological data from NASA
POWER database for each environment with g  Nð0; Gr2g Þ, where G ¼ XX0 =p (with p being the number of

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024


When running the commands for step 1 (Box 1, Case 2) on the SNPs and X the scaled and centered marker matrix),
maize dataset, a set of weather-based variables (see documenta- w  Nð0; Xr2w Þ, where X ¼ WW0 =q (with q being the number of
tion of the package) is automatically calculated using weather ECs and W the scaled and centered matrix that contains the ECs),
data retrieved from the NASA POWER database. By default, the gw  Nð0; ½Zg GZ0g 8Xr2gw ) where  denotes the Hadamard product
IID
method used to compute ECs uses a fixed number of day- (cell by cell product), eij  Nð0; r2e Þ.
windows (10) that span the complete growing season within each For additional details about the benchmark model, we refer to
environment. This optional argument can be modified via the the original publication of Jarquın et al. (2014). We implemented this
argument method_ECs_intervals (detailed information about the model using BGLR (Pérez and de Los Campos 2014), for which the
MCMC algorithm was run for 20,000 iterations and the first 2,000
different methods can be found at https://cjubin.github.io/
iterations were removed as burn-in using a thinning equal to 5.
learnMET/reference/get_ECs.html). The function summary() pro-
Two prediction models proposed in learnMET were tested: (1)
vides a quick overview of the elements stored and collected in
xgb_reg_1, which is an XGBoost model that uses a certain number
this first step of the pipeline (Box 7).Clustering analyses, that can
of principal components (PCs) derived from the marker matrix
help to identify groups of environments with similar climatic
and ECs, as features and (2) stacking_reg_3. Although computa-
conditions and to identify outliers, were generated based on (a)
tionally more expensive than parametric methods, we paid atten-
only climate data; (b) only soil data (if available); and (c) all envi-
tion to reasonable computational time (e.g. maximum of
ronmental variables together, for a range of values for K ¼ 2 to 10
13.3 hours to fit stacking_reg_3 model to n ¼ 4,587 training instan-
clusters (Fig. 2).
ces with 10 CPUs).
We conducted a forward CV0 CV scheme, meaning that fu-
ture years were predicted when using only past years as the
Box 6. Retrieving variable importance using the fitted training set. For the rice datasets, at least two years of data
model and the training data were used to introduce variation in the EC matrix characteriz-
ing the training set (only one location was tested each year).
> fitted_split <- results_list$list_results[[1]] Year, location or year-location effects were not incorporated in
any of the linear and machine learning models, because we fo-
# Model-specific: variable importance based on the gain as cused our evaluation on how the different models could effi-
importance metric from the XGBoost model (via vip pack- ciently capture the effects of SNPs and ECs, and of SNP  EC
age) interaction effects.
> variable_importance <- variable_importance_split( Results from the benchmarking approach are presented in
object ¼ fitted_split, Figs. 3 and 4. We have observed that the machine learning mod-
path_plot ¼“/project1/variable_imp_trset”, els are competitive with the linear reaction norm approach and
type ¼“model_specific”) tend to outperform it, albeit not consistently, as the training set
# Model-agnostic: variable importance based on 10 permu- size increases. Applied to small training set sizes, sophisticated
tations prediction models are likely not able to capture informative pat-
> variable_importance <- variable_importance_split( terns related to SNP  EC interactions, and linear models perform
object ¼ fitted_split, better. Similarly, the root mean square error was generally re-
path_plot ¼“/project1/variable_imp_trset”, duced with the machine learning methods as the training set in-
type ¼“model_agnostic”, creased (Fig. 4). Machine learning also performed better with the
permutations ¼ 10) G2F data that integrated multiple locations per year and was
therefore larger and probably more relevant to learn G  E pat-
# Model-agnostic: accumulated local effects plot
terns than with the rice dataset. Therefore, we encourage users
> ALE_plot_split(fitted_split,
to first evaluate whether their datasets are sufficiently large to le-
path_plot ¼“/project1/ale_plots,”
verage the potential of the advanced techniques proposed in this
variable ¼”freq_P_sup10_2”)
package and whether the latter provide satisfying predictive abili-
ties in CV settings.

Model interpretation from a gradient-boosted


Box 7. Summary method for class METData model fitted to the maize dataset
Figure 5a illustrates the permutation-based approach on the
> summary(METdata_g2f) maize dataset, and Fig. 5, b and c describe how two
8 | G3, 2022, Vol. 12, No. 11

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024

Fig. 2. Output results from the create_METData() function. a) Cluster analysis using K-means algorithm (K ¼ 4) to identify groups of similar environments
based on environmental data. b) Total within-cluster sum of squares as a function of the number of clusters. c) Average Silhouette score as a function
of the number of clusters. These methods can help users decide on the optimal number of clusters. Data used here are a subset of the Genomes to
Fields maize dataset (AlKhalifah et al. 2018; McFarland et al. 2020). Weather data were retrieved from NASA POWER database via the package
nasapower Sparks (2018). Plots are saved in the directory provided in the path_to_save argument.
C. C. Westhues et al. | 9

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024

Fig. 3. Correlations between predicted and observed values for a forward prediction scenario using two machine learning models and a linear reaction
norm approach. a) Three traits predicted for two rice populations. Each year is predicted based on at least two past years of phenotypic data (one single
location). b) Grain yield predicted for the G2F dataset. GC (rice data), percentage of chalky kernels; GY (rice data), grain yield (kg/ha); PHR (rice data),
percentage of head rice recovery; GY (G2F), bushels per acre.

environmental variables (sum of photothermal time and fre- be used to gain insights into a model’s predictions using the
quency of rainfall) influence the average prediction of maize package.
grain yield using ALE plots. We should stress that the size of the
dataset employed here is likely too small to make real inferences
about the relationship between the predictor variables and Concluding remarks and future developments
the outcome (sharp drops observed at some feature values). learnMET was developed to make the integration of complex data-
Our goal here is essentially to illustrate how these functions can sets, originating from various data sources, user-friendly. The
10 | G3, 2022, Vol. 12, No. 11

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024

Fig. 4. Root mean square error between predicted and observed values for a forward prediction scenario using two machine learning models and a
linear reaction norm approach. a) Three traits predicted for two rice populations. Each year is predicted based on at least two past years of phenotypic
data (one single location). b) Grain yield predicted for the G2F dataset. GC (rice data), percentage of chalky kernels; GY (rice data), grain yield (kg/ha);
PHR (rice data), percentage of head rice recovery; GY (G2F), bushels per acre.

package provides flexibility at various levels: (1) regarding the use environmental features via the argument list_env_predictors in
of weather data, with the possibility to provide on-site weather predict_trait_MET_cv()).
station data, or to retrieve external weather data, or a mix of To allow analyses on larger datasets, future developments of
both if on-site data are only partially available; (2) regarding how the package should include parallel processing to improve the
time intervals for aggregation of daily weather data are defined; scalability of the package and to best harness high performance
(3) regarding the diversity of nonlinear machine learning computing resources. Improvements and extensions of stacked
models proposed; (4) regarding options to provide manually models and deep learning models are also intended, as we did
specified subsets of predictor variables (for instance, for not investigate in-depth the network architecture (e.g. number of
C. C. Westhues et al. | 11

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024

Fig. 5. Model interpretation methods applied on the model fitted to a subset of the G2F dataset from years 2014 to 2016 (17 environments included) with
xgb_reg_1 for the trait grain yield. a) Model-agnostic variable importance using 10 permutations. The top 40 most important predictor variables are
displayed, and the table containing results across all permutations for all variables is returned. ALE plots for (b) sum of photothermal time during the
1st day-interval of the growing season, and (c) the frequency of days with an amount of precipitation above 10 mm during the 2nd day-interval of the
growing season. Tick marks indicate the unique values observed for the given covariate in the training set.
12 | G3, 2022, Vol. 12, No. 11

nodes per layer, type of activation function, type of optimizer), Chen T, Guestrin C. Xgboost: a scalable tree boosting system.
nor other types of deep learning models that might perform bet- Proceedings of the 22nd ACM SIGKDD International Conference
ter (e.g. convolutional neural networks). Finally, the package on Knowledge Discovery and Data Mining, San Francisco, CA,
could be extended to allow genotype-specific ECs, because the USA. 2016. p. 785–794.
timing of developmental stages differs across genotypes (e.g. due Chollet F, et al. 2015. Keras. https://keras.io.
to variability in earliness) and should ideally be taken into ac- Costa-Neto G, Fritsche-Neto R, Crossa J. Nonlinear kernels, domi-
count. nance, and envirotyping data increase the accuracy of genome-
based prediction in multi-environment trials. Heredity. 2021;
126(1):92–106.
Data availability Costa-Neto G, Galli G, Carvalho HF, Crossa J, Fritsche-Neto R.
The software is available on GitHub at https://github.com/cjubin/ Envrtype: a software to interplay enviromics and quantitative ge-
learnMET. Documentation and vignettes are provided at https:// nomics in agriculture. G3 (Bethesda). 2021;11:jkab040.

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024


cjubin.github.io/learnMET/. All scripts used to obtain the results Covarrubias-Pazaran G. Genome-assisted prediction of quantitative
presented in this article can be found on GitHub at https://github. traits using the R package sommer. PLoS One. 2016;11(6):
com/cjubin/learnMET/tree/main/scripts\_publication. e0156744.
Crossa J, Martini JWR, Gianola D, Pérez-Rodrıguez P, Jarquin D,
Juliana P, Montesinos-López O, Cuevas J. Deep kernel and deep
Acknowledgments learning for genome-based prediction of single traits in multien-
This work used the Scientific Compute Cluster at GWDG, the joint vironment breeding trials. Front Genet. 2019;10:1168.
data center of Max Planck Society for the Advancement of Cuevas J, Crossa J, Montesinos-López OA, Burguen ~ o J, Pérez-
Science (MPG) and University of Göttingen. We acknowledge sup- Rodrıguez P, de Los Campos G. Bayesian genomic prediction with
port by the Open Access Publication Funds of the Göttingen genotype  environment interaction kernel models. G3
University. The authors thank the G2F Consortium for collecting (Bethesda). 2017;7(1):41–53.
data and making these publicly available. The authors are grate- Cuevas J, Montesinos-López O, Juliana P, Guzmán C, Pérez-Rodrıguez
ful to Eliana Monteverde for her useful input regarding the rice P, González-Bucio J, Burguen~ o J, Montesinos-López A, Crossa J.
dataset, and also thank the National Institute of Agricultural Deep kernel for genomic and near infrared predictions in multi-
Research (INIA-Uruguay) and technical staff from the experimen- environment breeding trials. G3 (Bethesda). 2019;9(9):2913–2924.
tal station from Treinta y Tres (Uruguay) for collecting the data. de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus
In this work, data from the NASA POWER database were used. MPL. Whole-genome regression and prediction methods applied
These data were obtained from the NASA Langley Research to plant and animal breeding. Genetics. 2013;193(2):327–345.
Center POWER Project funded through the NASA Earth Science de Los Campos G, Pérez-Rodrıguez P, Bogard M, Gouache D, Crossa J.
Directorate Applied Science Program. A data-driven simulation platform to predict cultivars’ perform-
ances under uncertain weather conditions. Nat Commun. 2020;
11(1):1–10.
Funding Fisher A, Rudin C, Dominici F. All models are wrong, but many are
Financial support for CCW was provided by KWS SAAT SE by useful: learning a variable’s importance by studying an entire
means of a PhD fellowship. Additional financial support was pro- class of prediction models simultaneously. J Mach Learn Res.
vided by the University of Göttingen and by the Center for 2019;20:1–81.
Integrated Breeding Research. Friedman JH. Greedy function approximation: a gradient boosting
machine. Ann Stat. 2001;29(5):1189–1232.
Géron A. 2019. Hands-on Machine Learning with Scikit-Learn, Keras,
Conflicts of interest
and TensorFlow: Concepts, Tools, and Techniques to Build
None declared. Intelligent Systems. O’Reilly Media, UK Ltd.
Granato I, Cuevas J, Luna-Vázquez F, Crossa J, Montesinos-López O,
Burguen ~ o J, Fritsche-Neto R. BGGE: a new package for genomic-
Literature cited enabled prediction incorporating genotype  environment inter-
AlKhalifah N, Campbell DA, Falcon CM, Gardiner JM, Miller ND, action models. G3 (Bethesda). 2018;8(9):3039–3047.
Romay MC, Walls R, Walton R, Yeh C-T, Bohn M, et al. Maize Greenwell BM, Boehmke BC, Gray B. Variable importance plots—an
genomes to fields: 2014 and 2015 field season genotype, pheno- introduction to the vip package. R J. 2020;12(1):343.
type, environment, and inbred ear image datasets. BMC Res Heslot N, Yang H-P, Sorrells ME, Jannink J-L. Genomic selection in
Notes. 2018;11(1):1–5. plant breeding: a comparison of models. Crop Sci. 2012;52(1):
Apley DW, Zhu J. Visualizing the effects of predictor variables in 146–160.
black box supervised learning models. J R Stat Soc Series B Stat Jarquın D, Crossa J, Lacaze X, Du Cheyron P, Daucourt J, Lorgeou J,
Methodol. 2020;82(4):1059–1086. Piraux F, Guerreiro L, Pérez P, Calus M, et al. A reaction norm
Bernardo R. Breeding for Quantitative Traits in Plants, Vol. 1. model for genomic selection using high-dimensional genomic
Woodbury (MN): Stemma Press; 2002. and environmental data. Theor Appl Genet. 2014;127(3):595–607.
Breiman L. Stacked regressions. Mach Learn. 1996;24(1):49–64. Jarquın D, da Silva CL, Gaynor RC, Poland J, Fritz A, Howard R,
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. Battenfield S, Crossa J. Increasing genomic-enabled prediction ac-
Burguen~ o J, de los Campos G, Weigel K, Crossa J. Genomic prediction curacy by modeling genotype x environment interactions in
of breeding values when modeling genotype  environment in- Kansas wheat. Plant Genome. 2017;10:1–15.
teraction using pedigree and dense molecular markers. Crop Sci. Kuhn M, Johnson K. Applied Predictive Modeling, Vol. 26. Springer;
2012;52(2):707–719. 2013.
C. C. Westhues et al. | 13

Kuhn M, Wickham H. Tidymodels: a collection of packages for Rincent R, Malosetti M, Ababaei B, Touzy G, Mini A, Bogard M, Martre
modeling and machine learning using tidyverse principles. 2020. P, Le Gouis J, van Eeuwijk F. Using crop growth model stress cova-
Retrieved from https://CRAN.R-project.org/package¼tidymodels riates and AMMI decomposition to better predict genotype-by-
McFarland BA, AlKhalifah N, Bohn M, Bubert J, Buckler ES, Ciampitti environment interactions. Theor Appl Genet. 2019;132(12):
I, Edwards J, Ertl D, Gage JL, Falcon CM, et al. Maize genomes to 3399–3411.
fields (G2F): 2014–2017 field seasons: genotype, phenotype, cli- Ritchie MD, White BC, Parker JS, Hahn LW, Moore JH. Optimization of
matic, soil, and inbred ear image datasets. BMC Res Notes. 2020; neural network architecture using genetic programming
13(1):1–6. improves detection and modeling of gene-gene interactions in
McKinney BA, Reif DM, Ritchie MD, Moore JH. Machine learning for studies of human diseases. BMC Bioinformatics. 2003;4(1):28.
detecting gene-gene interactions. Appl Bioinformatics. 2006;5(2): Runcie DE, Qu J, Cheng H, Crawford L. Megalmm: mega-scale linear
77–88. mixed models for genomic predictions with thousands of traits.
Molnar C. Interpretable Machine Learning - A Guide for Making Genome Biol. 2021;22(1):1–25.

Downloaded from https://academic.oup.com/g3journal/article/12/11/jkac226/6705235 by guest on 02 November 2024


Black Box Models Explainable, 2nd ed.; 2022. Accessed via Shahriari B, Swersky K, Wang Z, Adams RP, Freitas ND. Taking the
https://christophm.github.io/interpretable-ml-book. human out of the loop: a review of Bayesian optimization. Proc
Montesinos-López OA, Montesinos-López A, Luna-Vázquez FJ, IEEE. 2016;104(1):148–175.
Toledo FH, Pérez-Rodrıguez P, Lillemo M, Crossa J. An r package Sparks AH. 2018. nasapower: a NASA POWER global meteorology,
for Bayesian analysis of multi-environment and multi-trait surface solar energy and climatology data client for R.
multi-environment data for genome-based prediction. G3 Van der Laan MJ, Polley EC, Hubbard AE. Super learner. Stat Appl
(Bethesda). 2019;9(5):1355–1369. Genet Mol Biol. 2007;6(1).
Monteverde E, Gutierrez L, Blanco P, Pérez de Vida F, Rosas JE, Westhues CC, Mahone GS, da Silva S, Thorwarth P, Schmidt M,
Bonnecarrère V, Quero G, McCouch S. Integrating molecular Richter J-C, Simianer H, Beissinger TM. Prediction of maize phe-
markers and environmental covariates to interpret genotype by notypic traits with genomic and environmental predictors using
environment interaction in rice (Oryza sativa L.) grown in subtrop- gradient boosting frameworks. Front Plant Sci. 2021;12:699589.
ical areas. G3 (Bethesda). 2019;9(5):1519–1531. Wickham H, Hester J, Chang W, Hester MJ. 2021. Package ‘devtools’.
Monteverde E, Rosas JE, Blanco P, Pérez de Vida F, Bonnecarrère V, Retrieved from https://cran.r-project.org/web/packages/dev
Quero G, Gutierrez L, McCouch S. Multienvironment models in- tools/index.html
crease prediction accuracy of complex traits in advanced breed- Xavier A, Muir WM, Rainey KM. bWGR: Bayesian whole-genome re-
ing lines of rice. Crop Sci. 2018;58(4):1519–1530. gression. Bioinformatics. 2019;36:1957–1959.
Pérez P, de Los Campos G. Genome-wide regression and prediction Zou H, Hastie T. Regularization and variable selection via the elastic
with the BGLR statistical package. Genetics. 2014;198(2):483–495. net. J R Stat Soc Series B Stat Methodol. 2005;67(2):301–320.
Pérez-Enciso M, Zingaretti LM. A guide on deep learning for complex
trait genomic prediction. Genes. 2019;10(7):553. Communicating editor: G. de los Campos

You might also like