About Us | Help Videos | Contact Us | Subscriptions
 

The Plant Genome - Original Research

Genomic Prediction of Genotype × Environment Interaction Kernel Regression Models

 

This article in TPG

  1. Vol. 9 No. 3
    unlockOPEN ACCESS
     
    Received: Mar 07, 2016
    Accepted: July 22, 2016
    Published: September 22, 2016


    * Corresponding author(s): j.crossa@cgiar.org
 View
 Download
 Alerts
 Permissions
Request Permissions
 Share

doi:10.3835/plantgenome2016.03.0024
  1. Jaime Cuevasa,
  2. José Crossa *b,
  3. Víctor Soberanisa,
  4. Sergio Pérez-Elizaldec,
  5. Paulino Pérez-Rodríguezc,
  6. Gustavo de los Camposd,
  7. O. A. Montesinos-Lópezb and
  8. Juan Burgueñob
  1. a Universidad de Quintana Roo, Chetumal, Quintana Roo, México
    b Biometrics and Statistics Unit of the International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, México DF, México
    c Colegio de Postgraduados, CP 56230, Montecillos, Edo. de México, México
    d Dep. of Epidemiology & Biostatistics, Michigan State Univ., 909 Fee Road, Room B601, East Lansing, MI 48824, USA

Abstract

In genomic selection (GS), genotype × environment interaction (G × E) can be modeled by a marker × environment interaction (M × E). The G × E may be modeled through a linear kernel or a nonlinear (Gaussian) kernel. In this study, we propose using two nonlinear Gaussian kernels: the reproducing kernel Hilbert space with kernel averaging (RKHS KA) and the Gaussian kernel with the bandwidth estimated through an empirical Bayesian method (RKHS EB). We performed single-environment analyses and extended to account for G × E interaction (GBLUP-G × E, RKHS KA-G × E and RKHS EB-G × E) in wheat (Triticum aestivum L.) and maize (Zea mays L.) data sets. For single-environment analyses of wheat and maize data sets, RKHS EB and RKHS KA had higher prediction accuracy than GBLUP for all environments. For the wheat data, the RKHS KA-G × E and RKHS EB-G × E models did show up to 60 to 68% superiority over the corresponding single environment for pairs of environments with positive correlations. For the wheat data set, the models with Gaussian kernels had accuracies up to 17% higher than that of GBLUP-G × E. For the maize data set, the prediction accuracy of RKHS EB-G × E and RKHS KA-G × E was, on average, 5 to 6% higher than that of GBLUP-G × E. The superiority of the Gaussian kernel models over the linear kernel is due to more flexible kernels that accounts for small, more complex marker main effects and marker-specific interaction effects.


Abbreviations

    BGLR, Bayesian generalized linear regression; CV2, cross-validation 2 design; DH, double-haploid; EB, empirical Bayesian; G × E, genotype × environment interaction; GBLUP, genomic best linear unbiased predictor; GBS, genotyping-by-sequencing; GS, genomic selection; KA, kernel averaging; RK, reproducing kernel; RKHS, reproducing kernel Hilbert space; SNP, single-nucleotide polymorphism; TRN, training; TST, testing

The seminal work of Meuwissen et al. (2001) showed that the prediction ability of complex traits was better when using whole-genome marker prediction than when using few markers. Results from simulation and empirical studies have shown that GS can improve prediction accuracy compared with conventional and pedigree selection (de los Campos et al., 2009, 2010; Crossa et al., 2010, 2011; Heslot et al., 2012; Pérez-Rodríguez et al., 2012). Several genomic Bayesian and non-Bayesian linear regression models have been developed and published, but most Bayesian regression alternatives differ only in their prior distributions (de los Campos et al., 2012). In genomic prediction, departures from linearity are the rule rather than the exception because complex cryptic interaction among genes (i.e., epistasis), and their interaction with the environment are part of the genetic composition of complex traits. These deviations from linearity can be addressed by semiparametric approaches such as RKHS regressions (Gianola et al., 2006, 2014); RKHS regression reduces the dimension of the parametric space and also captures small complex interaction among markers. Morota and Gianola (2014) mentioned that genome prediction coupled with combinations of kernels might capture nonadditive variation.

The RKHS approach to GS uses markers for building a covariance structure matrix that is a symmetric positive semidefinite of order n × n (where n is the number of lines) known as the reproducing kernel (RK) matrix, which depends on the markers and on the bandwidth parameter (h). This general approach requires choosing a RK, for example, a Gaussian kernel function where (xi,xi) are the marker vectors for the ith and i′th individuals, and is the squared Euclidean distance of the markers genotypes (Pérez-Rodríguez et al., 2012). Several studies have shown that Gaussian kernel methods improve the prediction of genetic values (Pérez-Elizalde et al., 2015) if the bandwidth parameter is correctly chosen. The bandwidth parameter can be selected based on (i) a cross-validation procedure, (ii) restricted maximum likelihood (Endelman, 2011), and (iii) an empirical Bayesian method such as the one proposed by Pérez-Elizalde et al. (2015). An efficient Bayesian method that minimizes the effect of an inadequate value of the bandwidth parameter when using a multikernel model is the RKHS-KA proposed by de los Campos et al. (2010), where each kernel uses some predefined value of the bandwidth parameter weighted by its associated variance component.

The previous models are commonly used for data collected in one environment. However, in plant breeding, G × E plays an important confounding role when selecting candidates because the phenotypic performance of a genotype varies across environments. A common way to assess the extent of G × E in plant breeding and agricultural experiments is to estimate genetic correlations of line performance across environments; these correlations summarize the joint action of genes and environmental conditions. The magnitude and sign of the interaction between genes and environmental conditions are expected to vary across the genome. In the last decade, the use of molecular markers has facilitated the mapping of chromosome regions associated with phenotypic variability and has revealed the differential response of these chromosome regions across environments.

In genomic selection, most analyses using the GBLUP (VanRaden, 2007, 2008) model are single-environment analyses. Several studies have proposed using GS models that accommodate G × E (Burgueño et al., 2012; Heslot et al., 2014, Jarquín et al., 2014) and show that modeling G × E can give substantial gains in genomic prediction accuracy (Zhang et al., 2014; Montesinos-López et al., 2015). Recently, López-Cruz et al. (2015) proposed a GBLUP prediction model that explicitly models G × E interactions between all available markers and environments (M × E). In genomic prediction, linear models consider genetic values as linear combinations of marker effects, such that GBLUP with G × E is equivalent to GBLUP with M × E (López-Cruz et al., 2015). However, other nonlinear kernels (for example, Gaussian kernels) cannot have this dual representation; thus, nonlinear kernels can only be represented as G × E models.

The main feature of the GBLUP-G × E (or GBLUP-M × E) interaction model is that it decomposes the marker effects into components that are common across environments (stability) and environment-specific deviations. In principle, the M × E model could help researchers detect which markers have effects that are stable across environments and which ones are responsible for G × E. In a recent study, Crossa et al. (2015) showed that the M × E model could be used as a shrinkage method and as a variable selection method assuming heterogeneity of within-environment variance. Results have shown that when the correlation between environments is positive, the GBLUP-G × E interaction model produces a significant increase in prediction accuracy as compared with the single-environment model or the across-environments models (López-Cruz et al., 2015).

In this study, we propose incorporating the Gaussian kernel method with the single-environment model and the G × E interaction model of López-Cruz et al. (2015). We compare the prediction accuracy of three kernel regression methods: (i) the linear kernel method of López-Cruz et al. (2015) (GBLUP for single environments and for GBLUP-G × E), (ii) the RKHS KA for single environments and for RKHS KA-G × E (de los Campos et al., 2010), and (iii) the Gaussian kernel with the bandwidth parameter (h) estimated by the empirical Bayesian method of Pérez-Elizalde et al. (2015) (RKHS EB for single environments and for RKHS EB-G × E). We illustrate the use of these three kernel methods in the single-environment and G × E interaction models in two data sets (maize and wheat) for the problem of predicting the performance of lines in environments where they have not been evaluated.


MATERIALS AND METHODS

Experimental Data

The experimental data used in this study comprise two multienvironment trials: one maize and one wheat. The maize data set includes 504 CIMMYT double-haploid (DH) maize lines evaluated in three environments and genotyped with 681,257 single-nucleotide polymorphisms (SNPs) genotyping-by-sequencing (GBS) (Crossa et al., 2013); the trait under study was grain yield. The wheat data set had 599 historical CIMMYT wheat lines evaluated in four environments and genotyped with 1717 diversity arrays technology markers (Crossa et al., 2010).

Maize Data Set

Phenotypic Data

A total of 504 CIMMYT DH maize lines were obtained by crossing and backcrossing eight parents to form 10 full-sib families of different sizes; some of these 10 full-sib families are related and form sets of related half-sib families. Each DH line from the 10 full-sib families was crossed to an elite single-cross hybrid of the opposite heterotic group to produce 504 testcrosses that were genotyped with GBS. The trait used in this study was grain yield (kg ha−1) evaluated in three optimum rainfed trials. The experimental field design in each of the three environments was an α-lattice incomplete block design with two replicates. Data were preadjusted using estimates of block and environment effects derived from a linear model that accounted for the incomplete block design within environment and for environmental effects.

Genotypic Data (Genotyping-by-Sequencing)

Initially, SNPs were filtered based on homozygosity in inbred lines, error rates were estimated based on biparental mapping populations, and linkage disequilibrium was based on mapping populations. Overall, this pipeline uncovered 681,257 SNPs in maize. The SNPs that segregated in this mapping population were used in this study. Source code and software for the pipeline are available at www.maizegenetics.net and sourceforge (http://sourceforge.net). Approximately 66% of the reads obtained could be mapped to a unique position in the genome and ∼80% of them were polymorphic. After filtering markers for minor allele frequency and missing values, a total of 158,281 SNPs were used.

Wheat Data Set

Phenotypic Data

This data set includes 599 wheat lines developed by the CIMMYT Global Wheat Program. Environments were grouped into four target sets of environments (E1–E4). The trait was grain yield (GY) and data were pre-adjusted by the experimental design used in each location.

Genotypic Data

Wheat lines were genotyped using 1717 diversity arrays technology markers (hereinafter generically referred to as markers) generated by Triticarte Pty. Ltd. (http://www.triticarte.com.au). These markers may take on two values denoted by their presence (1) or absence (0). In this data set, the overall mean frequency of the allele coded as 1 was 0.561, with a minimum of 0.008 and a maximum of 0.987. Markers with an allele frequency of 0.05 or 0.95 were removed. Missing genotypes were imputed using samples from the marginal distribution of marker genotypes. After marker edition, 1279 markers were retained.

Data Availability

Genotypic and phenotypic data of the two data sets used in this study can be found at the following link: http://hdl.handle.net/11529/10580.

Statistical Models

While the kernel is linear for the GBLUP, for the RKHS EB, it is Gaussian with bandwidth estimated by its posterior mode; for the RKHS KA, it is calculated as the weighted average of various kernels. For the G × E multienvironment model, the definition of the kernel matrix has two components: one for the marker main effect for all environments and the other for the marker effect in each environment (López-Cruz et al., 2015). Therefore, for the GBLUP G × E model, the linear kernel for the marker main effect (across all environments) and the linear kernel for the environment-specific marker effect are similar to those of López-Cruz et al. (2015) (GBLUP), whereas for the RKHS KA G × E and RKHS EB G × E models, different kernel matrices are described below.

Single-Environment Genomic Best Linear Unbiased Predictor Model

The parametric regression model for a single environment (j = 1, …, m environments) is defined as follows:where the vector yj represents nj independent observations of the response variable (genotypic means) in the jth environment, is a vector of ones of order nj, μj is the overall mean of the jth environment, Xj is the matrix for the p-centered and standardized molecular markers in the jth environment, βj is the vector of random effects of the p markers in the jth environment, and εj is the vector of random errors in the jth environment with normal distribution and common variance . We assume that the random effects of the markers are jointly distributed as a multivariate normal random variable, that is, and that the effects of markers βj and εj are independent. Also, if uj = Xjβj, then model [1] for the jth environment can be written as follows:where uj and εj are independent random variables with and , respectively, is the variance of uj, and Kj is a symmetric matrix representing the covariance of the genetic values; when , then model [2] is known as GBLUP (VanRaden, 2007, 2008). Note that under the conditions given above, models [1] and [2] are equivalent, with model [1] estimating the marker effects and model [2] estimating the genomic relationship by means of its linear kernel (XX′).

Gaussian Kernel with Bayesian Estimation of the Bandwidth Parameter for Single Environments

The Gaussian kernel commonly used in genomic prediction is (Pérez-Rodríguez et al., 2012), where dii is the distance based on markers between individuals i, i′ (i = 1, …, nj) for the jth environment. We have scaled the squared distance by the fifth percentile (q0.05). This symmetric and positive semidefinite matrix can be interpreted as a correlation matrix between genotypes, where the rate of decay is controlled by the bandwidth parameter in the jth environment (hj). The 0.05 quantile (q0.05)squared distance scaled the values of Kj, such that small values of hj (say 0.1) tend to generate values of the kernel matrix Kj of ones, whereas large values of hj (say 10) tend to produce diagonal matrices. These two extreme cases do not contribute information that can be used for prediction; for example, in the case of small values of hj, the information provided is redundant with the intercept and in the case of large values of hjwill lead to a random effect with a variance covariance structure similar to that of the residuals (de los Campos et al., 2010). Therefore, intermediate values of hj, such as those estimated by the method of Pérez-Elizalde et al. (2015), are desirable. With this method, hjis estimated by the mode of the posterior distribution of p(hj|yj), which should be calculated numerically. Below we give details of the posterior distribution of p(hj|yj).

A Bayesian method to estimate the bandwidth parameter hjis through its posterior distribution p(hj|yj); however, this is time-consuming if simulation of the posterior of hj and the other parameters of model [2] is performed simultaneously using Markov Chain Monte Carlo (Gianola and van Kaam, 2008), or any simulation method of the joint posterior, because of the need to compute the kernel in each iteration. An alternative approach proposed by Pérez-Elizalde et al. (2015) is based on the integrated likelihood of eliminating noisy parameters (Berger et al., 1999). The objective is to estimate the mode of the posterior distribution p(hj, φj|yj) where . Therefore, this method estimates the parameter hj with the mode of the distribution p(hj|yj, φj), whereas the other parameters are treated as a nuisance and eliminated through integration (the R [R Development Core Team, 2015] code in Box A1 [Appendix A] shows how to estimate hj). Then the Gaussian kernel is fully described, and model [2] can thus be fitted with the Bayesian generalized linear regression (BGLR) package (de los Campos and Pérez-Rodríguez, 2015).

Kernel Averaging Reproducing Kernel Hilbert Space with Kernel Averaging for Single Environments

The RKHS-KA method was proposed by de los Campos et al. (2010), who used an extension of Eq. [2] such that the random effect ujfor the jth environment is modeled as the sum of independent random effects, that is, , such that each random effect, ujl, (l = 1, … , r), follows a multivariate normal distribution with a vector of means equal to zero and a variance–covariance matrix , that is, . It is assumed that the matrix Kjl is Gaussian , where dii is the distance based on markers between individuals i and i′ for the jth environment (the squared distance is standardized by q0.05 (the quantile 0.05). Each Kjl has a different bandwidth parameter hjl. The set of values of hjl is selected by first defining a range of values that will avoid small and large values of hjl that could generate diagonal or unit matrices. The strategy for specifying the values of hjl using the BGLR package (de los Campos and Pérez Rodríguez, 2015) is also described in Pérez-Rodríguez and de los Campos (2014). One strategy consists of scaling the by its median (Crossa et al., 2010) with the aim of reducing the interval of possible values of hjl, and therefore avoid generating a Kjlmatrix of ones (global bandwidth) or at the other extreme, generating a diagonal matrix (local bandwidth). Then values for hjl are selected by exploring the structure of matrix Kjl. In the present study, and similar to Pérez-Rodríguez et al. (2012), we considered three kernels for RKHS KA with hj1 = 0.2, hj2 = 1, hj3 = 5 and divided by its 0.05 quantile.

Kernel averaging uses a prior distribution that assumes independence among random effects ; de los Campos et al. (2010) also show that RHKS-KA is equivalent to model [2] with , where and argue that inferring the weights leads to a kernel Kj that is optimal. In this manner, each random effect is weighted by its variance component in a process termed kernel averaging (KA) or multikernel.

Multi-Environment Marker × Environment Interaction Model

The parametric G × E interaction model considers the effects of m environments, and the effects of the markers are separated into two components: (i) the main effect of the markers for all the environments, β0 and (ii) the effect of the markers for each environment, βj (López-Cruz et al., 2015), that is, (i = 1, …, nj observations) (j = 1, … , m environments) (k = 1, … , p markers). In matrix notation, this is expressed as the following:In model [3], all m environments are considered simultaneously and the main effects of the markers across all environments are given by β0, whereas the specific effect of the markers in each environment is considered in βj It is assumed that random errors in the different environments are normally distributed with mean zero and homogeneous variance .

Model [3] can be reparametrized as the following:for u0 = X0β0 and uE = XEβE. Then, model [3] becomes the following:such that u0 captures the marker information among environments, and uE accounts for the marker information in each environment. The magnitude of their variance components determines the relative importance of u0 and uE in the model and, therefore, in the prediction assessment of each marker effect.

Model [4] assumes that the random effects u0 follow a multivariate normal distribution with –mean zero and a variance–covariance matrix , that is, , where the variance component is common to all m environments and the borrowing of information among environments is generated through kernel matrix K0. It also assumes that , whereFinally, it is assumed that , that is, homoscedasticity of the errors. Note that matrix KE can be expressed as a sum of m matrices so that model [4] can be expressed as follows:The effects given by uEj are specific for the th environment; thus it has a normal distribution with mean and variance equal to 0, except for the environment, which has a variance–covariance matrix .

Model [6] is multikernel or KA, such that the effects are weighted by their corresponding variance component. Therefore, the difference among the three methods (GBLUP-G × E, RKHS EB-G × E, and RKHS KA-KA) lies in the kernel matrix to be used.

Kernel Genomic Best Linear Unbiased Predictor–Genotype × Environment Interaction Model

For model [4], as described in López-Cruz et al. (2015), , and also in Eq. [5] KE is such that , and is specific for each environment. For models [4] or [6], the GBLUP uses a linear kernel for representing G × E; it is equivalent to model [3], which represents the M × E interaction as discussed in López-Cruz et al. (2015).

Kernel Reproducing Kernel Hilbert Space Empirical Bayesian–Genotype × Environment Interaction Model

In this case, K0 is constructed with the Gaussian kernel from the marker matrix X0, such that for all i,i′ lines [i = 1 … (n)(m)]. Also, uEN(0,KE), with the kernel matrix for each environment as previously defined by KE in Eq. [5], where is specific for each environment and ; in this case, i = 1 … nj is calculated for each environment using markers Xj and hj.

There are several methods for selecting h0 and hj. One option is to perform a joint estimation of the bandwidth parameters; however, this process is complicated. In this study, we used a simpler method in which bandwidth parameters are assigned without adjusting the model by a Bayesian estimation method (already described). We assumed two scenarios, one in which the marker-specific effects are zero (positive and high correlations between environments), thus h0 is estimated from K0, y = μ + u0 + ε (according to model [6]). The other scenario assumes that the marker main effect is zero (correlations between environments are negligible), and therefore each specific hj can be estimated from Kj, yj = μj + uj + εj (according to models [2] and [6]). Therefore, the codes of function margh.fun (see Box A1 in Appendix A) estimate each of these parameters. Therefore, kernels K0 and KE can be defined without guaranteeing an optimal selection of their bandwidth parameters. However, the structure of the G × E model allows dealing with situations where the correlations between environments are intermediate by weighting the effects with their corresponding variance components. In a second stage, model [6] is fitted with BGLR (Boxes A3 and A4 in Appendix A show the R codes and an example of how this method is applied).

Reproducing Kernel Hilbert Space Kernel Averaging–Genotype × Environment Interaction Model

The main effects of the markers u0 can be represented as the sum of r random effects u0 = u01 + … + u0l + … + u0r,whose distributions are multivariate normal with mean zero and variances depending on the Gaussian kernels, with the bandwidth parameters as previously defined, and the distances calculated from matrix X0. Therefore, for the three kernels used, hj1 = 0.2, hj2 = 1, and hj3 = 5, K0 in is a weighted average of three kernels .

Similarly, each Kj required in matrix KE for uEN(0,KE) (Eq. [5] and [6]) are also weighted averages of three Gaussian kernels, Boxes A5 and A6 (Appendix A) show the R codes and an example of how this method is applied.

Statistical Analysis

The models described above were fitted with methods GBLUP, RKHS EB, and RKHS KA for the wheat and maize data sets. For each data set, we performed: (i) single-environment analysis (e model [2]), (ii) analyses of pairs of environments, and (iii) an analysis of all environments. These last two approaches were implemented using the G × E model [4]. For inferences and prediction, training (TRN) and testing (TST) (Burgueño et al., 2012) random partitions were used. In each case, inferences and predictions were based on 30,000 samples collected from the posterior and predictive distributions after discarding 5000 samples as burn-in.

Prediction Assessment

Prediction accuracy was assessed using 50 TRN-TST random partitions; we used this approach because with a replicated TRN-TST design, one can obtain as many partitions as one needs. This allows estimating standard errors of estimates of prediction accuracy more precisely than with a cross-validation approach. Following Burgueño et al. (2012), we considered the problem of predicting the performance of lines evaluated in some but not all target environments (cross-validation 2 design [CV2]). Supporting information in López-Cruz et al. (2015) provides the R code used to generate TRN-TST partitions in CV2. We assigned 70% of the lines (bivariate or all environments) to TRN and the remaining 30% to TST.

For each TRN-TST partition, models were fitted to the TRN data set and prediction accuracy was assessed by computing Pearson’s product–moment correlation between predictions and phenotypes in the TST data set within environment. Variance components were estimated in each of the TRN-TST partitions. The same TRN-TST partitions were used to assess the prediction accuracy of each of the methods (GBLUP, RKHS KA, and RKHS EB), which yielded 50 correlation estimates for each method. For each data set, we report the average across partitions and standard deviations of correlations, Pearson’s estimates, and average variance component estimates.

Variance Components Estimation across Random Partitions

For the single environment and the three methods (GBLUP, RKHS EB, and RKHS KA), the mean of the variance components for the main effects and the error estimated from the 50 random partitions are reported. The mean of the three kernels was computed for method RKHS KA. Similarly, for the G × E models GBLUP-G × E, RKHS EB-G × E, and RKHS KA-G × E, we calculated the mean of the estimated variance components for the marker main effect, the marker-specific effects, and the error across all 50 random partitions. For RKHS KA-G × E, we calculated three kernels for the marker main effects and marker-specific effects.

From Eq. [4], the phenotypic response variables for the ith line in environments j and j′ for the G × E model are yji = μj + u0ji + uEji + εji and yji = μj + u0ji + uEji + εji, respectively; the estimated phenotypic correlation for this pair of environments can thus be expressed as a function of variance components (López-Cruz et al., 2015):

This forces the correlation between environments to be positive or zero, as the variance components are, by definition, positive or zero. Therefore, in the G × E model, the covariance between any two environments is linearly related to the marker main effect, and this determines that their relationship should be either zero or positive. In this study, we report the estimated correlations based on the mean of the estimated variance components as well as the sample phenotypic correlation.

Software Used and Definition of the Priors

The models described above were fitted using the BGLR software. Hyperparameters were defined using the default implementation in BGLR (see Pérez-Rodríguez and de los Campos [2014] for further details). Specific details on the R codes developed to fit the RKHS EB-G × E models are similar to the R codes presented in Pérez-Elizalde et al. (2015) and López-Cruz et al. (2015) and are given in Appendix A.


RESULTS

Prediction Accuracy of the Models with the Three Kernels

Wheat Data

The G × E structure of the wheat data shows Environment 1 to be clearly different from the other three environments (2, 3, and 4) (zero and negative correlations), whereas the other three environments have relatively high to intermediate pair-wise phenotypic correlations. Table 1 shows the average correlation between predicted and observed values of 50 random partitions for kernel GBLUP, RKHS KA, and RKHS EB methods in the single-environment and G × E models (GBLUP-G × E, RKHS EB-G × E and RKHS KA-G × E) in the multienvironment analyses. These predictions were performed with the mode of predictive distributions constructed from the training set. For the RKHS EB single-environment model and the RKHS EB-G × E model, the bandwidth parameter (hj) was previously estimated for each partition in the training set.


View Full Table | Close Full ViewTable 1.

Wheat data set. Mean correlation (for 50 random partitions) for single-environment and genome × environment interaction (G × E) models with genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space (RKHS) empirical Bayes (EB), and RKHS kernel averaging (KA) methods with standard deviation in parentheses.

 
Single-environment model
Environment GBLUP RKHS EB RKHS KA Percentage change
RKHS EB vs. GBLUP RKHS KA vs. GBLUP RKHS EB vs. RKHS KA
G × E model
Pair Sample correlation Single GBLUP-G × E
RKHS EB-G × E
RKHS KA-G × E
Percentage change
Mean Percentage change† Mean Percentage change‡ Mean Percentage change§ RKHS EB-G × E vs. GBLUP-G × E RKHS KA-G × E vs. GBLUP-G × E RKHS EB-G × E vs. RKHS KA-G × E
1 0.500 (0.056) 0.577 (0.043) 0.556 (0.052) 15 11 4
2 0.474 (0.048) 0.477 (0.056) 0.482 0.050) 1 2 −1
3 0.370 (0.056) 0.422 (0.053) 0.401 (0.055) 14 8 5
4 0.447 (0.047) 0.511 (0.044) 0.522 (0.040) 14 17 −2
1–2 −0.019 1 0.509 (0.045) 2 0.582 (0.040) 1 0.581 (0.055) 4 14 14 0
2 0.476 (0.055) 0 0.475 (0.048) 0 0.480 (0.052) 0 0 1 −1
1–3 −0.1934 1 0.501 (0.045) 0 0.585 (0.041) 1 0.566 (0.057) 2 17 13 3
3 0.361 (0.056) −3 0.415 (0.053) −2 0.399 (0.053) −1 15 11 4
1–4 −0.123 1 0.505 (0.045) 1 0.585 (0.042) 1 0.577 (0.054) 4 16 14 1
4 0.431 (0.055) −4 0.502 (0.045) −2 0.511 (0.044) −2 16 19 −2
2–3 0.661 2 0.650 (0.033) 37 0.681 (0.032) 43 0.689 (0.032) 43 5 6 −1
3 0.576 (0.035) 36 0.674 (0.036) 60 0.673 (0.035) 68 17 17 0
2–4 0.441 2 0.542 (0.039) 14 0.523 (0.055) 10 0.527 (0.042) 9 −3 −3 −1
4 0.493 (0.049) 10 0.551 (0.040) 8 0.567 (0.037) 9 12 15 −3
3–4 0.387 3 0.461 (0.046) 25 0.498 (0.051) 18 0.503 (0.043) 25 8 9 −1
4 0.498 (0.050) 11 0.563 (0.042) 10 0.558 (0.034) 7 13 12 1
1–2–3–4 1 0.445 (0.054) −11 0.458 (0.060) −21 0.438 (0.063) −21 3 −2 5
2 0.600 (0.041) 27 0.644 (0.047) 35 0.661 (0.043) 37 7 10 −3
3 0.523 (0.043) 41 0.586 (0.046) 39 0.565 (0.039) 41 12 8 4
4 0.489 (0.043) 9 0.543 (0.044) 6 0.573 (0.042) 10 11 17 −5
2–3–4 2 0.639 (0.040) 35 0.678 (0.040) 42 0.678 (0.036) 41 6 6 0
3 0.590 (0.04) 59 0.654 (0.038) 55 0.655 (0.040) 3 11 11 0
4 0.537 (0.05) 20 0.550 (0.055) 8 0.555 (0.043) 6 2 33 −1
Percentage change between GBLUP-G × E vs. GBLUP single environment.
Percentage change between RKHS EB-G × E vs. RKHS EB single environment.
§Percentage change between RKHS KA-G × E vs. RKHS KA single environment.

Several results regarding the models’ prediction accuracy and the correlation between environments are depicted in Table 1. For the single-environment model (first section of Table 1), the prediction accuracy of RKHS EB and RKHS KA is clearly better than that of GBLUP in Environments 1, 3, and 4 but not in Environment 2. The percentage change in accuracy for Environments 1 and 3 of RKHS EB vs. GBLUP (15 and 14%) was superior to that of RKHS KA vs. GBLUP (11 and 8%), whereas for Environment 4, the opposite occurred (14 vs. 17%, respectively). As expected, the accuracies of RKHS EB and RKHS KA for the single-environment model are very similar, with an increase in prediction accuracy of RKHS EB over RKHS KA in Environments 1 and 3 (4 and 5%, respectively).

Results of the prediction accuracy of the G × E models GBLUP-G × E, RKHS EB-G × E, and RKHS KA-G × E for pairs of environments and for all four environments are given in the second part of Table 1. Environment 1 is negatively correlated with Environments 3 and 4 and has zero correlation with Environment 2, whereas the rest of the pair-wise correlations between environments (2–3, 2–4, 3–4) are positive and relatively high, especially for Environments 2–3 (0.661). When the pair-wise correlation between environments is negligible or negative (Environments 1–2, 1–3, 1–4), the accuracy in predicting the performance of the wheat lines in each environment of each of the three G × E models is similar to that of their single-environment counterparts. If the correlation between environments is positive (Environments 2–3, 2–4, 3–4), the prediction accuracy of the lines in the environments of each of the three G × E models is higher than that of their single-environment counterparts, reaching prediction accuracies for Site 3 (for pair 2–3) of 60 and 68% for RKHS EB-G × E vs. RKHS EB single environment and for RKHS- G × E vs. RKHS KA single environment, respectively. When the four environments are analyzed together using the three G × E models, all environments (except Environment 1) are predicted with higher accuracy than when the single-environment GBLUP, RKHS EB, and RKHS KA methods were fitted (Table 1). When only the three environments with high correlation are combined (2–3–4), the prediction accuracy of the G × E models surpasses the single-environment model by up to 55 to 59%. In summary, for positively correlated environments, the three methods GBLUP-G × E, RKHS EB-G × E, and RKHS KA-G × E had higher prediction accuracy than their single-environment GBLUP, RKHS EB, and RKHS KA counterparts.

Regarding the pair-wise environment combinations, the differences in accuracy between the three G × E methods (GBLUP-G × E, RKHS EB-G × E, and RKHS KA-G × E) are similar to those observed in the single-environment method. The percentage change in accuracy of the comparison between the three G × E models shows RKHS EB-G × E and RKHS KA-G × E having a greater increase in prediction accuracy than GBLUP-G × E an equal number of times for the different pair-wise environments and when all four environments are analyzed together. These results indicated similar prediction accuracy superiority of the RKHS EB-G × E and RKHS KA-G × E methods over the GBLUP-G × E. Also, when comparing the change in prediction accuracy of RKHS EB-G × E and RKHS KA-G × E, results indicated very small differences, as expected.

In summary, results show that for positively correlated environments (little G × E), models RKHS EB-G × E and RKHSKA- G × E have prediction accuracies that are clearly superior to those of their single-environment counterparts. Regardless of the sign of the correlation between pairs of environments, methods RKHS EB and RKHA KA, with the single environment model and RKHS EB-G × E, and RKHS KA-G × E had higher prediction accuracy than model GBLUP with the single environment or GBLUP-G × E. Thus, the advantage of modeling G × E using kernel averaging or the Gaussian kernel method with the bandwidth parameter estimations over the GBLUP for single environments or G × E in the wheat data was clear.

Maize Data

The G × E structure of the maize data indicates that all three pair-wise correlations between environments are positive but relatively low, indicating intermediate to high G × E. Table 2 shows the average correlation between predicted and observed values of 50 random partitions for kernel GBLUP, RKHS KA, and RKHS EB single-environment and kernel GBLUP-G × E, RKHS KA-G × E, and RKHS EB-G × E methods. For the single-environment models, the advantages of the prediction accuracy of RKHS EB and RKHS KA models over the GBLUP are clear for the three environments; the increases in prediction accuracy ranged from 3 to 12%. Small differences in prediction accuracy between the RKHS EB and RKHS KA models were detected, with increases in prediction accuracy of 1, 1, and 3% of the RKHS EB model over the RKHS KA model for Environments 1, 2, and 3, respectively. The percentage change in accuracy of RKHS EB vs. GBLUP was superior (4, 7, and 12%) to that of RKHA KA vs. GBLUP (3, 6, and 9%) for the three environments.


View Full Table | Close Full ViewTable 2.

Maize data set. Average correlation (for 50 random partitions) for single-environment and G × E models with genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space (RKHS) empirical Bayes (EB), and RKHS kernel averaging (KA) methods with standard deviation in parentheses.

 
Single-environment model
Environment GBLUP RKHS EB RKHS KA Percentage change
RKHS EB vs. GBLUP RKHS KA vs. GBLUP RKHS EB vs. RKHS KA
G × E model
Pair Sample correlation Single GBLUP-G × E
RKHS EB-G × E
RKHS KA-G × E
Percentage change
Mean Percentage change† Mean Percentage change‡ Mean Percentage change§ RKHS EB-G × E vs. GBLUP-G × E RKHS KA-G × E vs. GBLUP-G × E RKHS EB-G × E vs. RKHS KA-G × E
1 0.558 (0.038) 0.583 (0.042) 0.577 (0.052) 4 3 1
2 0.507 (0.049) 0.542 (0.056) 0.535 (0.057) 7 6 1
3 0.508 (0.051) 0.568 (0.044) 0.552 (0.050) 12 9 3
1–2 0.388 1 0.600 (0.042) 8 0.623 (0.046) 7 0.626 (0.043) 8 4 4 0
2 0.572 (0.053) 13 0.590 (0.049) 5 0.583 (0.052) 9 3 2 1
1–3 0.262 1 0.572 (0.046) 3 0.594 (0.043) 2 0.603 (0.044) 5 4 6 −1
3 0.518 (0.050) 2 0.579 (0.048) 2 0.577 (0.046) 5 12 11 0
2–3 0.153 2 0.520 (0.064) 3 0.536 (0.048) −1 0.535 (0.057) 0 3 3 1
3 0.502 (0.046) −1 0.564 (0.050) −1 0.560 (0.0471) 1 12 9 1
1–2–3 1 0.618 (0.042) 11 0.630 (0.043) 8 0.648 (0.037) 12 2 5 −3
2 0.547 (0.053) 8 0.566 (0.043) 4 0.564 (0.067) 5 3 3 0
3 0.519 (0.044) 2 0.556 (0.055) −2 0.576 (0.047) 4 7 11 −3
Percentage change between GBLUP-G × E vs. GBLUP single environment.
Percentage change between RKHS EB-G × E vs. RKHS EB single environment.
§Percentage change between RKHS KA-G × E vs. RKHS KA single environment.

Prediction accuracies of models GBLUP-G × E, RKHS EB-G × E, and RKHS KA-G × E for pairs of environments and for all three environments were higher than those of their single-environment counterparts. When the three environments were analyzed together using the three G × E models, all environments were predicted with much more accuracy than with the single-environment GBLUP, RKHS EB, and RKHS KA methods (Table 2). The three methods, GBLUP-G × E, RKHS EB-G × E, and RKHS KA-G × E, are more accurate for predicting the observed lines in each individual environment than the single-environment GBLUP, RKHS EB, and RKHS KA methods.

Regarding the pair-wise combinations between environments, results are similar to those found in the wheat data, that is, the differences in accuracy among the three G × E methods are similar to those observed in the single-environment method. The percentage change in accuracy when comparing the three G × E models shows that RKHS EB-G × E and RKHS KA-G × E had a higher increase in prediction accuracy than GBLUP-G × E. Very small differences were found when comparing the change in prediction accuracy of RKHS EB-G × E and RKHS KA-G × E.

In general, results show that since all three environments are positively correlated, G × E interaction models RKHS EB-G × E and RKHSKA-G × E have prediction accuracies that are clearly superior to those of their single-environment counterparts. Single-environment methods RKHS EB and RKHA KA and methods RKHS EB-G × E and RKHS KA-G × E had greater prediction accuracy than GBLUP single-environment models or GBLUP G × E.

Variance Component Estimation of the Models with the Three Kernels

Wheat Data

Figure 1 depicts the patterns of the average marker main effect and the residual variance components of the single-environment models for GBLUP, RKHS EB, and RKHS KA. The mean value of the residuals for the kernel RKHS KA method are the smallest for four environments followed by the residuals for the kernel RKHS EB method; the residuals for the kernel GBLUP are the highest of the three models for the four environments. The marker main effects variance component for each environment increased approximately three- to fourfold from the GBLUP to the RKHS EB and RKHS KA methods (Table B1, Appendix B).

Fig. 1.
Fig. 1.

Mean of the variance components (and their standard deviation) of the marker main effect and the residual for each of the four environments (1–4) in the wheat data set for the genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space (RKHS) empirical Bayes (EB), and RKHS kernel averaging (KA) models.

 

Figure 2a–c (and Table B1, Appendix B) give the mean variance components (residual, marker main effect, and marker-specific effects) for 50 random partitions for the M × E models and the three kernel methods, GBLUP, RKHS EB, and RKHS KA, for each pair of environments. For RKHS KA single environment, the variance component of the marker main effects comprises the sum of the variance components for the three kernels as previously described. Similarly, RKHS KA-G × E contains the sum of the variance components associated with three of the marker main effects and the sum of the variance components associated with the three kernels of the marker-specific effects. Table B1 (Appendix B) also shows the sample correlation and the estimated correlation between pairs of environments obtained from the mean of the variance component for the marker main effect and the marker-specific effect as previously described and as explained in López-Cruz et al. (2015).

Fig. 2.Fig. 2.Fig. 2.
Fig. 2.

(continued on next page) For six pairs of environments in the wheat data sets (1–2, 1–3, 1–4, 2–3, 2–4, and 3–4), the plot shows the mean of the variance components for three models of the (A) residual variance component, (B) marker main effect, and (C) marker environment specific effect. Marker-specific effect 1 denotes the variance component of the marker effect associated with the first environment in the pair (e.g., 3 in pair 3–4), whereas marker-specific effect 2 is related to the variance component of the second environment in that pair (e.g., 4 in pair 3–4).

 

For the M × E models, results can be separated between pairs of environments with negligible or negative correlations (1–2, 1–3, and 1–4) vs. those obtained for pairs of environments with positive correlations (2–3, 2–4, and 3–4). In general, the residual variances for RKHS KA-G × E are the smallest followed by the residual variances for RKHS EB-G × E and GBLUP-G × E for all pair-wise environments including analyses combining all four environments (Table B1, Appendix B). Concerning the variance components computed for the marker main effects, the variance components are negligible for the pair of environments with zero or negative correlation (increased G × E) and higher for the pair of environments with positive correlation (decreased M × E) (Fig. 2b). The opposite trends occurred in the marker-specific effects, which tended to have lower values for pairs of environments with positive correlations (decreased G × E), and high values for pairs of environments with negligible or negative correlations (increased M × E) (Fig. 2c). Note that for the pair of environments with zero or negative correlation, the marker main effect and specific-environment variance components are smaller for the GBLUP-G × E than those estimated from the RKHS- G × E and RKHS KA-G × E models. Models RKHS EB-G × E and RKHS KA-G × E also showed higher values for the estimated correlation than model GBLUP-G × E.

Maize Data

Figure 3 shows the marker main effect and the residual variance components of the single-environment models. Similar to the trends found in the wheat data, the residuals for the kernel RKHS KA and RKHS EB methods are the smallest for three environments, whereas the residuals for the kernel GBLUP are the highest. The marker main effects variance component for each environment increased from the GBLUP to the RKHS EB and RKHS KA methods (Table B2, Appendix B).

Fig. 3.
Fig. 3.

Mean of the variance components and their standard deviation of the marker main effect and the residual for each of the three environments (1–3) in the maize data set for the genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space (RKHS) empirical Bayes (EB), and RKHS kernel averaging (KA) models.

 

Figure 4a–c and Table B2 (Appendix B) give the mean variance components (residual, marker main effect, and marker-specific effects) for 50 random partitions for the G × E models and the three kernel methods, GBLUP, RKHS EB, and RKHS KA, for each pair of environments (all with positive correlations). The residual variances of RKHS KA-G × E are the smallest followed by the residual variances of RKHS EB-G × E and GBLUP-G × E for all pair-wise environments including analyses combining all three environments (Fig. 4a; Table B2, Appendix B). A better fit (lower residuals) of models for pairs of environments with higher correlations is found. Also, nonlinear kernel models had a better fit than GBLUP.

Fig. 4.Fig. 4.Fig. 4.
Fig. 4.

(continued on next page) For three pairs of environments in the maize data sets (1–2, 1–3, 2–3, and 2–3), the plot shows the mean of the variance components for three models (genomic best linear unbiased prediction [GBLUP]–genotype × environment interaction [G × E] model, reproducing kernel Hilbert space [RKHS] empirical Bayes [EB]–G × E, and RKHS kernel averaging [KA]–G × E [and their standard deviation]) of the (A) residual variance component, (B) marker main effect, and (C) marker environment specific effect. Marker-specific effect 1 denotes the variance component of the marker effect associated with the first environment in the pair (e.g., 2 in pair 2–3), whereas marker-specific effect 2 is related to the variance component of the second environment in that pair (e.g., 3 in pair 2–3).

 

For the marker main effect, variance component increases for the pair of environments with higher correlation (Environments 1–2; decreased G × E) and decreases for the pair of environments with lower correlations (Environments 1–3 and 2–3; increased G × E) (Fig. 4b). On the other hand, the marker-specific effect variance component decreases for the pair of environments with higher correlation (Environment 1–2) and increases for pairs of environments with lower correlations (Environments 1–3 and 2–3) (Fig. 4c). In summary, the patterns of the variance component of the residual, marker main effect, and marker-specific effects for pairs of environments are clearly delineated between the G × E models with linear kernel (GBLUP-G × E) vs. nonlinear kernel (RKHS KA-G × E and RKHS EB-G × E) (Fig. 4a–c).


DISCUSSION

For the G × E models, the borrowing of information is implemented through the estimation of the marker main effects (u0, Eq. [4]), while the marker-specific effect contributes to information from each specific environment (u0, Eq. [4]) (López-Cruz et al., 2015). These additive models allow weighting the prediction of the marker main and specific effects throughout their variance components, for which estimates indicate the level of influence of each type of marker effect. This occurs in the models with the three methods (GBLUP, RKHS KA, and RKHS EB).

Prediction of the Linear and Nonlinear Kernel Methods and Models

The prediction assessment in this study was performed by applying the cross-validation strategy called CV2 in Burgueño et al. (2012), Jarquín et al. (2014), and López-Cruz et al. (2015), where some lines are represented in some environments but not in others. The prediction accuracies of the G × E models are directly related to the correlations between environments. Comparisons can be made between the same methods for two models (i.e., RKHS EB single-environment vs. RKHS EB-G × E) or between the same models for two different methods (i.e., RKHS EB single-environment vs. RKHS KA single-environment). For the wheat data set, models Gaussian kernel RKHS EB-G × E and RKHS KA-G × E show increases in prediction accuracy over their single-environment counterparts, which varied depending on the correlation between environments; prediction accuracy reached up to 60 to 68% over the GBLUP-G × E. For pair-wise environments with negligible or negative correlations, no gains in prediction accuracy were achieved by RKHS EB-G × E and RKHS KA-G × E as compared with the RKHS EB and RKHS KA single-environment model because, as previously mentioned, the interaction model forces the covariance between environments to be nonnegative. Burgueño et al. (2012) also found that prediction of Environment 1 with low or even negative correlations with the others had decreased prediction accuracy with respect to the prediction of Environment 2 with positive correlations with Environments 3 and 4.

The superiority of the G × E model over the single-environment model is because the G × E model allows borrowing of information of one line across environments, that is, the G × E model allows using information from the same line tested in correlated environments. This is an important feature of the G × E model that can be used in sparse schemes for testing lines in multienvironment trials.

For comparisons between methods for the G × E model, there is a clear trend of increasing prediction accuracy of RKHS EB-G × E and RKHS KA-G × E over GBLUP-G × E for all pair-wise comparisons between environments (including those with negative, negligible, or positive correlations) with increases in prediction accuracy of up to 16 to 19% for the wheat data set. Differences found in the prediction accuracies of the two Gaussian kernel models, RKHS EB-G × E and RKHS KA- G × E, were very small.

For the maize data set, the increase in prediction accuracy did not reach values as high as those found in the wheat data set, but the gains in prediction accuracy over the GBLUP single environment and over the GBLUP G × E were consistent for the single-environment analyses, as well as for each pair of environments, and for all three environments analyzed together. The average increase in prediction accuracy of RKHS EB-G × E and RKHS KA-G × E over GBLUP-G × E was 5 to 6% across pair-wise environments and all three environments.

These results can be explained because the single-environment model uses the rigid GBLUP linear kernel as compared with the Gaussian kernel, which has better predictive ability and a more flexible structure (Gianola et al., 2014). This flexibility depends on the estimation of the bandwidth parameter (h). The two Gaussian kernel methods used in this study gave similar results in terms of prediction accuracy; the RKHS EB method optimizes the bandwidth parameter based on the integrated likelihood (Pérez-Elizalde et al., 2015), whereas RKHS KA calculates the kernel based on the weighted mean of various kernels such that during the simulation process at convergence, an optimal weighted kernel is generated.

Results of the prediction of the genetic values for a single environment show that the RKHS EB method is slightly superior to RKHS KA, in agreement with the results of Pérez-Elizalde et al. (2015). However, for the G × E interaction, the prediction accuracies of both RKHS EB-G × E and RKHS KA-G × E for pairs of environments are very similar; however, RKHS KA seems to be slightly better than RKHS EB when combining all the environments. This may be due to the different ways in which the two methods optimize the bandwidth parameter. Although the estimation of h in RKHS EB-G × E is done independently of the main effects and for each of the specific marker effects, and is eliminated as a nuisance parameter in the integration, empirical results pointed to a slight decrease in prediction accuracy when more than two environments are fitted. In contrast, the RKHS KA-G × E method estimates the three kernels within the framework of the model but is computationally slower than the RKHS EB-G × E method because it needs to fit three times more parameters (variance components) from the random effects (three kernels). The joint estimation of the parameters of the bandwidth of the RKHS EB-G × E without intensive computation time is a challenging problem that needs further research. However, the numerical method used to estimate the bandwidth parameters presented in this study provided a good predictive performance possibly favored by the flexible structure of the G × E model.

Variance Components

The RKHS EB and RKHS KA methods fitted the data much better than the GBLUP model for the single-environment and G × E models (the residual error variance component was always lower). When two environments are positively correlated, the marker main effects are the most influential component when predicting genetic values; its variance component is high, and this produces better predictive accuracy than the single-environment model. When the correlation between any two environments is positive intermediate or high (indicating small G × E interaction), the model reacts by reducing the marker-specific effect; thus, the prediction depends solely on the marker main effect. On the other hand, when the correlation between any two environments is near zero or negative, implying strong G × E interaction, the fact that the model cannot borrow information from one environment to predict other environments (López-Cruz et al., 2015) makes the model increase the marker-specific effects and reduce the marker main effect. Clearly, the proportion of genomic variance explained by the main effect of markers and the marker-specific effects is directly related to the sample phenotypic correlation between environments.

The variance of the observations is the same as the sum of the variance of the marker main effects, marker-specific effects, and the error. Therefore, the positive correlation of the observations among environments is explained by the variance component of the model; when the variance of the marker main effect is high, the correlation between environments is high (less G × E); otherwise, the correlation between environments is low. When the correlation between environments is negligible or negative, the G × E model cannot explain the interaction because the sample correlation is estimated through variance components that are, by definition, all positive or zero. Therefore, the G × E model’s prediction accuracy is similar to that of the single-environment model because the variance of the marker main effects tends to be small and positive. Then, negative correlations between environments are not used to improve the predictions, as can be observed in Table 1 and Table B1 (Appendix B) for wheat.

Practical results do not allow us to establish a threshold value for the importance of the marker main effect. However, the results of this study, as well as those of López-Cruz et al. (2015) and Crossa et al. (2015), show that the higher the correlation between environments, the better the predictions obtained with the G × E model for lines in environments where they were never tested. Therefore, in practical situations, the Gaussian kernel methods can be used to improve prediction when little or no G × E is present. The variance component structure of the G × E model explains the previous behavior of the models and determines that when the percentage of variances explained by the marker main effects is small (e.g., in Environments 1 and 4 in the wheat data set), the prediction accuracy and the variance component of the G × E model are similar to those found in the single-environment model. On the other hand, López-Cruz et al. (2015) found that when the marker main effect variance is large, the interaction model performs similarly to the across-environment model (not fitted in this study).

Advantages and Limitations of the Genomic Best Linear Unbiased Predictor–Genotype × Environment Interaction and Gaussian Kernel Genotype × Environment Interaction Models

The RKHS EB-G × E and RKHS KA-G × E models decompose the marker effects into a component that is common to all the environments and a specific marker component for each environment using a nonlinear Gaussian kernel. The GBLUP-G × E estimates the common marker effect for all environments and the specific-environment marker effect by means of a linear kernel. A marker component that is common to all environments allows exchanging information among environments; therefore, the main advantage is that the information between environments with a positive correlation can be used to predict other environments. In this way, an increase in genomic prediction accuracy is achieved. When the correlation between environments is positive and high, the G × E is small or null and these models increase prediction accuracy.

However, when the correlation between environments is close to zero or negative, these models do not increase prediction accuracy. For example, Environment 1 in the wheat data produced most of the G × E; it showed negative correlations with Environments 2, 3, and 4. Thus, prediction accuracies did not improve much over the single-environment model (Table 1). In a real situation, correlations between environments can be negative, around zero or positive, indicating different levels of G × E. In this regard, other models had advantages over the GBLUP-G × E, RKHS-G × E, and GBLUP-G × E models because they allow incorporating environments with zero or negative association by using variance–covariance matrices among environments for the genetic values as well as the residuals such that the off-diagonal matrix will reflect these correlations. For example, the unstructured and factor analytic mixed linear models had shown good prediction accuracy for the four environments of the wheat data set used in this study (Burgueño et al., 2012). However, these multienvironment mixed linear models cannot deal with common and specific marker effects; for example, for the wheat data set, Burgueño et al. (2012) fitted the unstructured model only to Environments 2, 3, and 4 and obtained prediction accuracies of 0.64, 0.58, and 0.50, respectively; the GBLUP-G × E gave rise to prediction accuracies of 0.64, 0.59, and 0.54, respectively; the RKHS EB-G × E model had prediction accuracies of 0.68, 0.65, and 0.55, respectively; and RKHS EB-G × E computed prediction accuracies of 068, 0.65, and 056 for Environments 2, 3, and 4, respectively.

In terms of prediction accuracy, the Gaussian kernel methods (RKHS EB-G × E and RKHS KA-G × E) are as good as or better than the GBLUP- G × E. For a complex trait, such as grain yield, with a large number of cryptic small epistasis effects between markers, the Gaussian kernel methods are more flexible than the GBLUP for capturing the effect of such small, subtle interactions. For example, in the wheat data set, the absolute increase of the Gaussian kernel methods over the GBLUP when combining environments with positive correlations (Environments 2, 3, and 4) is ∼0.05, which, despite being comparatively lower than the effect of removing Environment 1, represents a reasonable improvement in predictive ability. For the maize data set, absolute increase in prediction accuracies of Gaussian kernel methods over GBLUP is ∼0.03.


CONCLUSIONS

The phenomenon of G × E in plant breeding makes it difficult to select candidates for the next cycle, thus negatively affecting response to selection. Recently, several genomic models that incorporate G × E have been proposed. The GBLUP-G × E model proposes linear kernel as a function of the markers and decomposes the total marker effect into the marker main effect and the marker-specific environment effect. The two Gaussian kernel methods implemented in the G × E interaction model (RKHS KA and the RKHS EB) applied to two data sets showed consistent increases in genomic prediction accuracy as compared with the linear kernel regression used in the GBLUP. When G × E is low, the advantages in terms of prediction accuracy of the G × E model over the single-environment model for the kernel methods were also clear for pairs of environments with positive correlations but not for pairs of environments with negligible or negative correlations. Variance components of the marker main effects were usually larger than the variance component as a result of marker-specific environment effects when the sample correlation between environments was positive; the opposite occurred when the environments were not correlated or showed negative correlations.

The higher prediction accuracy of the Gaussian kernel models (RKHS EB and RKHS KA) over the linear kernel (GBLUP) coupled with the G × E model is because nonlinear kernels are more flexible when accounting for more complex and usually cryptic and small marker main effects and marker-specific interaction effects. The results of this study are promising for applying G × E with Bayesian Gaussian kernel regression methods and for performing prediction in real genomic selection situations when G × E is expected to be present but low.

APPENDIX A

Details of the R codes for fitting the different methods and models are shown in Boxes A1 through A5.

Box A1.
Box A1.

Estimation of the bandwidth (h) parameter (adapted from Pérez-Elizalde et al., 2015).

 
Box A2.
Box A2.

Function to generate the missing values for cross-validation 2 design (CV2) scheme (López-Cruz et al., 2015).

 
Box A3.
Box A3.

Function to fit genotype × environment interaction model (adapted from López-Cruz et al., 2015).

 
Box A4.
Box A4.

An example of fitting the reproducing kernel Hilbert space (RKHS) empirical Bayes (EB)–genotype × environment interaction model.

 
Box A5.
Box A5.

Fitting the reproducing kernel Hilbert space (RKHS) kernel averaging (KA)–genotype × environment interaction model.

 
Box A6.
Box A6.

An example of reproducing kernel Hilbert space (RKHS) kernel averaging (KA)–genotype × environment interaction model.

 

APPENDIX B

Tables B1 and B2 follow.


View Full Table | Close Full ViewTable B1.

Wheat data set with four environments (1–4). Average variance component across partitions for residual, marker main effect, and marker-specific effects for single-environment (standard deviation in parenthesis) (genomic best linear unbiased prediction [GBLUP], reproducing kernel Hilbert space [RKHS] empirical Bayes [EB], and RKHS kernel averaging [KA]), and genotype × environment interaction (G × E) models (GBLUP-G × E, RKHS EB-G × E, and RKHS KA-G × E). The variance components associated with the three kernels for RKHS KA-G × E are averaged. Sample and estimated correlations between pair of environments.

 
Single-environment model
Environment GBLUP
RKHS EB
RKHS KA
Residual Main effect Residual Main effects Residual Main effects
G × E model
Combination Sample correlation Single GBLUP-G × E
RKHS EB-G × E
RKHS KA-G × E
Residual Main effect Specific effect Estimated correlation Residual Main effect Specific effect Estimated correlation Residual Main effect Specific effect Estimated correlation
1 0.525 (0.0405) 0.567 (0.0599) 0.309 (0.0546) 0.856 (0.0770) 0.277 (0.0615) 1.121 (0.0966)
2 0.578 (0.0420) 0.483 (0.0449) 0.571 (0.0557) 2.090 (0.6776) 0.411 (0.0358) 1.250 (0.1252)
3 0.601 (0.0511) 0.504 (0.0716) 0.267 (0.0526) 0.801 (0.0.0913) 0.302 (0.0366) 1.220 (0.1371)
4 0.600 (0.0419) 0.469 (0.0474) 0.377 (0.0601) 0.792 (0.0687) 0.338 (0.0395) 1.091 (0.0820)
1–2 −0.019 1 0.570 (0.0297) 0.030 (0.0022) 0.205 (0.0235) 0.037 0.464 (0.0418) 0.103 (0.0184) 0.614 (0.077) 0.050 0.380 (0.0464) 0.118 (0.0049) 0.784 (0.0592) 0.082
2 0.191 (0.0232) 2.620 (0.9138) 1.110 (0.1547)
1–3 −0.1934 1 0.582 (0.0294) 0.027 (0.0022) 0.202 (0.0229) 0.033 0.333 (0.0466) 0.088 (0.0338) 0.780 (0.0669) 0.070 0.304 (0.0411) 0.095 (0.0023) 0.851 (0.0538) 0.070
3 0.201 (0.0232) 0.942 (0.1424) 1.000 (0.1012)
1–4 −0.123 1 0.572 (0.0286) 0.025 (0.0023) 0.212 (0.0221) 0.03 0.344 (0.0482) 0.076 (0.0244) 0.750 (0.0653) 0.060 0.330 (0.0390) 0.106 (0.0038) 0.828 (0.0567) 0.082
4 0.193 (0.0236) 0.800 (0.08950) 0.857 (0.0607)
2–3 0.661 2 0.495 (0.0224) 0.282 (0.0254) 0.040 (0.0033) 0.345 0.368 (0.0196) 0.115 (0.1550) 0.181 (0.0422) 0.680 0.291 (0.0201) 1.060 (0.2008) 0.192 (0.0082) 0.685
3 0.043 (0.0037) 0.116 (0.0156) 0.210 (0.0089)
2–4 0.441 2 0.575 (0.0263) 0.140 (0.0219) 0.076 (0.0114) 0.18 0.481 (0.0389) 0.611 (0.1606) 0.859 (0.8322) 0.377 0.290 (0.0294) 1.100 (0.1734) 0.538 (0.0717) 0.577
4 0.076 (0.0121) 0.256 (0.0960) 0.511 (0.0367)
3–4 0.387 3 0.596 (0.0261) 0.150 (0.0249) 0.071 (0.0096) 0.185 0.418 (0.0359) 0.520 (0.0577) 0.317 (0.0815) 0.424 0.395 (0.0343) 0.605 (0.0892) 0.344 (0.0236) 0.450
4 0.066 (0.0101) 0.260 (0.0566) 0.344 (0.0207)
1–2–3–4 1 0.571 (0.0177) 0.139 (0.0148) 0.324 (0.0291) 0.445 (0.0320) 0.814 (0.2997) 0.895 (0.0992) 0.394 (0.0284) 0.477 (0.1229) 1.300 (0.2541)
2 0.049 (0.0071) 0.529 (0.2920) 0.35 (0.0492)
3 0.046 (0.0074) 0.247 (0.0704) 0.341 (0.0345)
4 0.080 (0.0162) 0.353 (0.0534) 0.426 (0.0332)

View Full Table | Close Full ViewTable B2.

Maize data set with three environments (1-3). Average variance component for residual, marker main effect, and marker-specific effects for single-environment model (standard deviation in parenthesis) (GBLUP, RKHS EB, and RKHS KA) and for G × E models (GBLUP- G × E, RKHS EB-G × E, and RKHS KA-G × E). The variance components associated with the three kernels for RHKS KA are averaged. Sample and estimated correlations are between pair of environments.

 
Single-environment model
Environment GBLUP
RKHS EB
RKHS KA
Residual Main effect Residual Main effect Residual Main effect
G × E model
Combination Sample correlation Single GBLUP-G × E
RKHS EB-G × E
RKHS KA-G × E
Residual Main effect Specific effect Estimated correlation Residual Main effect Specific effect Estimated correlation Residual Main effect Specific effect Estimated correlation
1 0.443 (0.0396) 0.522 (0.0536) 0.326 (0.0388) 0.873 (0.0915) 0.232 (0.0162) 1.080 (0.1090)
2 0.390 (0.0401) 0.723 (0.0913) 0.269 (0.0377) 1.100 (0.0635) 0.214 (0.0168) 1.430 (0.1399)
3 0.500 (0.0325) 0.526 (0.0615) 0.267 (0.0269) 0.801 (0.0573) 0.235 (0.0176) 0.990 (0.0677)
1-2 0.388 1 0.415 (0.0305) 0.377 (0.0639) 0.180 (0.0292) 0.371 0.290 (0.0296) 0.533 (0.0444) 0.380 (0.1059) 0.423 0.234 (0.0230) 0.599 (0.0907) 0.443 (0.0275) 0.423
2 0.271 (0.0509) 0.488 (0.0578) 0.668 (0.0649)
1-3 0.262 1 0.489 (0.0275) 0.200 (0.0457) 0.248 (0.0411) 0.212 0.306 (0.0330) 0.290 (0.0608) 0.571 (0.1067) 0.263 0.248 (0.0261) 0.271 (0.0239) 0.706 (0.0662) 0.232
3 0.285 (0.0538) 0.448 (0.0610) 0.593 (0.0324)
2-3 0.153 2 0.491 (0.0366) 0.105 (0.0213) 0.437 (0.0617) 0.104 0.252 (0.0263) 0.124 (0.0280) 0.960 (0.1052) 0.105 0.231 (0.0223) 0.135 (0.077) 1.334 (0.1637) 0.100
3 0.392 (0.0698) 0.664 (0.0547) 0.736 (0.0406)
1-2-3 1 0.472 (0.0177) 0.230 (0.0148) 0.193 (0.0291) 0.279 (0.0320) 0.274 (0.02097) 0.576 (0.0992) 0.242 (0.0284) 0.300 (0.1229) 0.592 (0.2541)
2 0.313 (0.0071) 0.721 (0.2920) 0.720 (0.0492)
3 0.343 (0.0074) 0.539 (0.0704) 0.539 (0.0345)
 

References

Footnotes


Comments
Be the first to comment.



Please log in to post a comment.
*Society members, certified professionals, and authors are permitted to comment.