An important component of any MET is the experimental design at each trial location. Effective experimental designs control plot-to-plot within-location variability so that data reflect the true genetic potential of each cultivar at the location (Oehlert, 2000). Randomized complete block designs (RCBDs) have the advantage that they are simple and work well when environmental conditions within a block are uniform, as is often the case in studies with small numbers of genotypes (<10) and optimal field conditions (Bos, 2008). Randomized complete block designs are not recommended, however, for experiments that include >10 genotypes or for variable field conditions such as those encountered under drought and low N. For these situations, incomplete block designs (e.g., lattice designs) break the field into smaller and more homogenous sub-blocks for analysis, creating a greater reduction in within-environment variation so that differences among genotypes are more precisely measured.
In addition to adjusting for location effects, it is sometimes necessary to adjust a trait for a correlated trait. The correlated trait is included as a covariate in the model. The META program allows the user to adjust one user-specified variable, called the main response variable (MRV), by a covariate. For example, when breeding for drought tolerance in maize (Zea mays L.), it is useful to adjust grain yield for anthesis date. Anthesis date can be strongly positively or negatively correlated with grain yield, depending on the environment; if the target and selection environments do not match perfectly, selection will be ineffective unless yield is adjusted (Banziger et al., 2004; Campos et al., 2004).
Unbalanced data also complicate the analysis of METs. Lattice designs inherently contain unbalanced data and RCBDs frequently do as well due to adverse field conditions, seed shortages, or other errors (Spilke et al., 2005). In general, METs must be analyzed using a mixed model because they contain a mixture of fixed and random effects. Replicates, incomplete blocks, and sites are considered random effects, while the covariate (if any) is a fixed effect. Genotypes can be considered either a fixed or a random effect, depending on the goals of the analysis and the way the genotypes were selected (Smith et al., 2005). Unbalanced data and mixed effects preclude the estimation of variances using the standard fixed effects model; instead, variances are estimated by restricted maximum likelihood (REML) (Holland, 2006). With unbalanced, mixed effects, a simple mean does not adequately describe the data. Instead, BLUEs or best linear unbiased predictors (BLUPs) must be used (McLean et al., 1991; Shaw and Mitchell-Olds, 1993). For models with all fixed effects, BLUEs are the appropriate statistic because they estimate the mean performance of a response variable using ordinary least squares. When data are unbalanced, the minimization of deviation from the multivariate regression results in deviation from the simple mean but more accurately represents the true performance of the response variable. Best linear unbiased predictors allow random effects to be included in the model and again minimize deviation from the multivariate regression.
There are multiple ways to use MET data when making selections: (i) selections can be made based on a combined analysis across all locations; (ii) when multiple management conditions are tested, one management condition may be weighted more heavily than another when making selections (this is often the case with stressed locations); and (iii) weights can be assigned such that certain locations are weighted more heavily than others, possibly because one type of location is more prevalent in the target breeding area or it may be due to past experience (Bos, 2008). Because breeders may wish to use the data from a MET in many ways, the program suite can output results per individual location, per location identified by management, or per location combined by management levels, or the overall results may be combined across all locations depending on the user’s preference.
Here we present META, a suite of SAS programs that analyze data from RCBD and lattice designs including one or more covariates easily and rapidly; META also computes BLUEs, BLUPs, variance components, least significant difference (LSD), coefficient of variation (CV), and broad-sense heritability, among others statistics, of the genotypes evaluated in METs. We illustrate the use of this program with a case study of two data sets. The first is a CIMMYT MET, which is part of the Drought Tolerant Maize for Africa Project of CIMMYT’s Global Maize Breeding Program. Because this MET has two management conditions and adjusts grain yield by anthesis date, it is ideal for demonstrating the power of this program. We also present the analysis of an RCBD data set, where 16 genotypes were evaluated at 36 locations across four countries in Africa. In this case, we have grouped the genotypes by country instead of management conditions to show the flexibility of the program.
MATERIALS AND METHODS
The drought tolerance data set consisted of 100 genotypes that were evaluated in five environments using an α-lattice design with two replicates per location. Three environments were drought stressed and two were managed under optimal water conditions. When applicable, grain yield was adjusted by anthesis date. Data on grain yield, anthesis date, anthesis to silking interval, and number of plants at harvest per plot were collected from all five locations and were included in the analysis. Ear height and plant height were collected at only four locations and were also analyzed to show that the program will not include locations with all values missing in its analyses.
The RCBD data set looked at 16 genotypes grown with four replicates at 36 locations in four countries. Data on the traits of days to tasselling, plant height, moisture, and grain yield were taken at all locations, while data on stalk lodging were taken at 28 locations. To show that analyses can be grouped in other ways besides management, this analysis was grouped by country. We did not adjust by a covariate because these trials were grown under well-watered conditions.
The META Suite
The META suite consists of 33 associated SAS programs. It has been optimized for use with SAS version 9.2, but is compatible with version 9.1 (it was not tested with earlier versions). A flow diagram of the main options offered by META is depicted in Fig. 1. The user can run the entire suite of programs using the program META Menus, which has a menu-driven user interface. The user does not need to modify the SAS code; all options are chosen and the data read in through a graphical user interface. Although no changes are required, advanced users may wish to make some modifications; full details of common changes are provided in the user’s manual (provided in the supplemental materials) and as comments in the program’s code. Instructions on formatting data for META are also provided in the user’s manual.
Running the META Suite
To run META, the driver program called META Menus must be opened in SAS and run. This will launch a series of menus where the user tells the programs what type of analysis to run (Fig. 1). First the Select Design menu will launch; this allows the user to select the type of experimental design used: lattice or RCBD (Fig. 2a). Next, the Select Covariate menu launches; it allows the user to choose analysis with or without a covariate (Fig. 2b). The Data Entry menu then has the user input details about the data set, such as where the SAS programs are saved and what the MRV is called (Fig. 2c). Errors can occur when introducing data-related information in (i) the input path that specifies where the programs are located, (ii) the output path for storing the results, (iii) the input file name, or (iv) any of the names of factors and variables requested; in these cases, the program will send an error message (Fig. 2d). When the user presses Enter, the Data Entry menu reappears and errors can be corrected before moving to the next menu. Once the data are entered correctly, a menu launches that allows the user to choose between data visualization and data analysis (Fig. 2e). The data visualization submenu has three options: boxplots, frequency histograms, and return to the previous menu (Fig. 2f). The data analysis submenu has seven options. The submenus for the two experimental designs (lattice and RCBD with and without covariates) are slightly different; the first number changes, with 11 to 17 indicating a lattice design with covariate, 21 to 27 indicating an RCBD with covariate, 31 to 37 a lattice design without a covariate, and 41 to 47 indicating an RCDB without a covariate. The suboptions are the same for each design type and include: 1, genetic correlations among locations; 2, BLUE and BLUP analysis by location without identifying the management type; 3, BLUE and BLUP analysis by location but sorted by management; 4, BLUEs and BLUPs combined by management type; 5, BLUEs and BLUPs combined across all locations; 6, all suboptions 1 to 5 run in a simple step, and 7, exit META (Fig. 2g). Details of the data visualization and data analysis options are provided below. Boxplots or histograms can be printed for data visualization. Boxplots are printed for all locations by trait, while frequency histograms are printed by location, trait, and replicate using the SAS procedures PROC Boxplot and PROC Univariate, respectively (Fig. 2f). The histogram option also prints basic statistics (mean, median, mode, standard deviation, variance, range, and interquartile range) and the extreme observations (the five lowest and five highest), which help detect mistakes or outliers.
Five types of analyses can be performed for data analysis. Using the options shown in Fig. 2g as an example, when locations are analyzed individually (Options 12 and 13), the results for every genotype in every location are printed. When the analysis is combined across locations (Options 11, 14, and 15), any location with a heritability less than the user-defined threshold (default = 0.05) will not be used in the combined analysis across locations. The default threshold (0.05) was used for the sample analyses described here. Options 11, 14, and 15 group together different locations, so the first step in these programs is to calculate the heritability for each location and delete any locations that fall below the threshold.
The corresponding linear models are implemented in PROC Mixed of SAS using REML to estimate the variance components. For analyses of individual locations using a lattice design and adjusting by a covariate, using the same syntaxes as in the SAS programs, the model iswhere Y is the trait of interest, μ is the mean effect, Repi is the effect of the ith replicate, Blockj(Repi) is the effect of the jth incomplete block within the ith replicate, Genk is the effect of the kth genotype, Cov is the effect of the covariate, and ɛijk is the error associated with the ith replication, jth incomplete block, and kth genotype, which is assumed to be normally and independently distributed, with mean zero and homoscedastic variance σ2. When calculating the BLUEs, both the genotypes and the covariate are considered fixed effects, whereas all other terms are declared random effects; for calculating the BLUPs and broad-sense heritability, all effects are considered random except the covariate.
For individual analyses using an RCBD and adjusting by a covariate, the corresponding model becomeswhere the replicates now correspond to the complete blocks and all other terms are as above. For individual analyses without adjusting by a covariate, the models are the same as above, except that the term of the covariate is deleted.
For the analyses combined across management conditions or across all locations, new terms are added to the above models. For the lattice design adjusted by a covariate, the model iswhere the new terms Loci and Loci × Genl are the effects of the ith location and the location × genotype interaction, respectively. Again, for a combined analysis of an RCBD, the above model becomes
Similarly, for the analyses without a covariate, the models for the lattice design and the RCBD, respectively, are
Also, in these last four combined models, all the effects are considered random, with two exceptions. Genotype is a fixed effect when calculating BLUEs, and the covariate is always a fixed effect.
Broad-sense heritability of a given trait at an individual location is calculated aswhere σg2 and σe2 are the genotype and error variance components, respectively, and nreps is the number of replicates. For the combined analyses, the heritability is calculated aswhere the new term σge2 is now the genotype × environment interaction variance component and nlocs is the number of locations in the analysis. In both cases, the heritability of a given trait at a location or across all locations is printed to a .csv file and to the screen. In the combined analyses, if a location will be discarded due to low heritability (lower than the threshold selected), it will have the code –999 listed as additional information in the output (Table 1). The estimation of broad-sense heritability (repeatability) provides good insight into the quality of a breeding program for traits and environments that are well known.
Suboption 1 calculates phenotypic and genotypic correlations between locations, from which a distance matrix is calculated as the identity matrix (matrix with 1s on the diagonal and 0s in every other position) minus the genetic correlations matrix. Calculation of the genetic correlations matrix is explained below. The distance matrix is used as the input data set to create a dendrogram using PROC Cluster and PROC Tree and a biplot of principal component analysis (PCA) using PROC PrinComp. Both plots can be viewed directly on the screen or saved in a computer graphics metafile (.cgm), which can be imported into several Microsoft Office programs (PowerPoint, Word, Excel, etc.).
The phenotypic correlations of MRVs between locations are simple Pearson correlations between MRVs at the different location pairs, calculated using PROC Corr. The genetic correlations among locations are calculated using equations from Cooper et al. (1996):where is the arithmetic mean of all pairwise genotypic covariances between environments j and j′, and is the arithmetic average of all pairwise geometric means among the genotypic variance components of the environments (Cooper et al., 1996).
Suboptions 2 to 5 calculate BLUEs and BLUPs for the MRV and BLUEs for all other variables present in the data set. They will also calculate the number of replicates, number of locations, location variance, genotypic variance, genotype × location variance, residual variance, grand mean, LSD, CV, and broad-sense heritability for all traits and locations in the individual analyses and for all traits across locations in the combined analyses. The different suboptions change how location and management are analyzed. In Suboptions 2 and 3, each location is analyzed individually, but in Suboption 3, locations are organized by management condition. This is accomplished by including a “by loc” statement in PROC Mixed. For Suboption 4, locations are combined by management type by including a “by management” statement in PROC Mixed. For Suboption 5, all locations are combined. Different mixed model equations are used for the different experimental designs, with or without adjusting by the covariate, as detailed above.
When the program calculates BLUEs for the MRV, only the covariance parameter estimates and Type 3 tests of fixed effects are printed (all other information from PROC Mixed is suppressed). The simple mean of the genotypic BLUEs is also calculated and serves as the grand mean. The LSD (p = 0.05) is calculated as: LSD = t(1 – 0.05/2, dferror) × ASED, where t is the cumulative Student’s t distribution, 0.05 is the α level selected, dferror are the degrees of freedom for error in the mixed model, and ASED is the average standard error of the differences for all pairwise comparisons between genotypes; it reflects the precision of the trial for that specific trait. The CV is calculated as: CV = (ASED/grand mean) × 100. The CV is highly dependent on the level of the grand mean of the trial; experiments under drought stress may often show high values of CV just because of a low grand mean. The program then calculates BLUPs for the MRV using the same equations as above but with genotype now considered a random effect; covariance parameter estimates and Type 3 tests of fixed effects are printed. The BLUP for each genotype is the grand mean added to the estimated random effect for each genotype. Heritability is then calculated using the equations provided above.
Next, BLUEs, LSD, CV, and heritability are calculated for all other traits. The same equations as for the MRV are used, except no traits are adjusted by a covariate, and genotypes are always considered as fixed effects for BLUEs and as random effects for the calculation of heritability. Also, only BLUEs are calculated, not BLUPs. Covariance parameter estimates and Type 3 tests of fixed effects are printed. Heritability, LSD, and CV are calculated as above for the MRV. Finally, all estimates and statistics are printed to the screen and to a .csv file.
Example 1: Sample Lattice Data Set, Drought Tolerance Data
For the drought tolerance data, boxplots and frequency histograms were generated to visualize the data and identify outliers. Most of the sample data met our expectations; for example, with anthesis date, the dates under stressed conditions were generally later than under optimal conditions (Fig. 3a). For ear height, however, we were able to identify an outlier that was probably a data recording error for the Tlaltizapan, Mexico, optimal conditions location (Fig. 3b). This data point was removed from later analyses.
Analyses of the patterns of phenotypic and genetic correlations were performed (Table 2), as well as a dendrogram created of the location clustering (Supplemental Fig. 1a) and a PCA plot of the first two components from the distance matrix (Supplemental Fig. 1b). Although the original data set contained five locations, one location had a heritability below our threshold of 0.05 (Table 1), so the program deleted that location from all analyses that combined data across locations. Before adjustment for anthesis date, all locations had a heritability greater than the cutoff (0.05), so the unadjusted data used all locations in all analyses that combined data across locations (Supplemental Table 4).
Next, the data were analyzed using Suboptions 13, 14, and 15, because this was a lattice design that was adjusted by anthesis date as a covariate. Given that 100 genotypes were analyzed, it would not be practical to present the results for each genotype; instead, the first three genotypes and the statistics were chosen as examples (Tables 3, 4, and 5). Due to space limitations, the results from the first two locations only were included in Table 3. Full results are available in Supplemental Tables 1 to 3. As shown in the tables, the output from the program is organized as follows: each column is a different trait, with two columns for the MRV (yield in this example) and one for the other traits. For each location and trait, the number of replications, estimates of variance components, grand mean, LSD, CV and heritability are provided.
|Statistic||Location||Genotype||BLUE for yield||BLUP for yield||AD||ASI||EH||PH|
|Statistic||Management||Genotype||BLUE for yield||BLUP for yield||AD||ASI||EH||PH|
|Location × genotype variance||0.76||0.24||0.00||0.00||7.93|
|Location × genotype variance||0.19||0.11||2.33||27.62||76.15|
|Statistic||Genotype||BLUE for yield||BLUP for yield||AD||ASI||EH||PH|
|Location × genotype variance||0.59||0.14||1.31||4.81||36.66|
Example 2: Sample Randomized Complete Block Design Data Set
The sample RCBD data set differed from the lattice data set because it looked at a fraction of the genotypes across more locations. Due to the large number of locations, the sample RCBD data set illustrates the ability of the program to cluster locations by genetic distance. There were three locations that behaved differently from the others: locations Gwoza, Tsafe, and Zuru formed a separate cluster, with Tsafe and Zuru grouping together in both the cluster and PCA analysis (Fig. 4). The reason that these locations are so distinct could not be determined from this analysis. Full results for each trait are available in Supplemental Tables 9 to 13 and Supplemental Fig. 2a and 2b.
Sample Lattice Data
Analysis using the META suite of SAS programs revealed some interesting patterns in our data. For example, although we might expect environments to cluster by management, instead they clustered most strongly by location (Supplemental Fig. 1a and 1b). The analysis also showed the value of adjusting yield by the anthesis date; when adjusted, we obtained different yield estimates. For the combined analysis across all locations with yield adjustment, the BLUP for the highest yielding genotype (B-292) was 8.13 Mg ha−1; without adjustment, it was 6.40 Mg ha−1 (Supplemental Tables 4 and 9). By looking at the analysis by management, we see that the largest change from adjustment occurred under stressed conditions. Again, looking at our highest yielding genotype, the BLUPs for yield under optimal conditions with and without a covariate were: 10.78 vs. 10.69, respectively, and under stressed conditions 5.11 vs. 3.41, respectively. This confirms previous work that showed how important it is to adjust by anthesis date when analyzing data from water-stressed environments (Banziger et al., 2004).
The program calculates both BLUEs and BLUPs for the MRV (yield in our examples); which estimator to use has been hotly debated in the literature (Piepho and Mohring, 2006; Smith et al., 2005). If the data are balanced and orthogonal, then the BLUPs and BLUEs will be equivalent; however, this is rarely the case in METs, especially if a lattice design is used. The choice of statistic can make a real difference; in the drought data set, the rankings for yield for BLUPs and BLUEs are identical until the seventh highest yielding genotype; however, the differences between rankings are within the LSD.
Sample Randomized Complete Block Design Data
The RCBD data set showed interesting clustering based on genetic correlations of grain yield at different locations; three locations were distinct from all others and formed their own cluster. They are all locations within Nigeria; however, the other Nigerian locations were indistinct and mixed with other locations in a large cluster. We were not able to identify the cause of this unique clustering. When selecting the top 20% of genotypes based on yield, we obtained the same results whether we used BLUPs or BLUEs. This is because the data were balanced within a trait and replicated across many locations. Nevertheless, selections were different for BLUE and BLUP if we look at the best-yielding genotypes by country. Therefore, the list of the highest yielding genotypes for each country would be different if BLUE or BLUP were used.
Value of the Program
The META suite is a valuable tool for plant breeders because it can allow them to rapidly analyze METs for phenotypic and genetic correlations between locations, BLUEs and BLUPs for a MRV, BLUEs for all other traits, heritability, LSD, and CV. It allows analysis of common designs with or without a covariate. Instructions in the user’s manual explain how to quickly expand the program to accommodate multiple covariates. The BLUEs, BLUPs, and adjustment by a covariate can also be calculated for any trait by selecting an option from a menu. Despite all the options in this program, no changes must be made to the SAS code; everything is run through a menu-driven interface. The flexibility, power, and ease of use of this program make it a valuable instrument in a breeder’s toolbox. The SAS code for META and the user’s manual are available for free download as supplemental material.