Figure 1.

Single nucleotide polymorphism (SNP) calling from genotyping-by-sequencing tags. For reference-independent SNP calling, a population-based filtering approach was used. (A) Putative SNPs were first identified by internal alignment of sequence tags allowing 1 to 3 bp mismatch in a 64 bp tag. (B) The number of individuals (samples) in the population with each SNP allele were tallied and a Fisher exact test was conducted to test if the two alleles were independent. Within an inbred line, alleles at a biallelic SNP locus should be mutually exclusive (i.e., the inbred line should not have both alleles). Putative SNPs that failed the Fisher test (p-value < 0.001) were considered biallelic SNPs in the population and converted to SNP calls. (C) Based on presence–absence of the different tags in the individuals across the population, genotype scores were assigned. By incrementally increasing the stringency of the alignments, paralogous sequence on the alternate genomes could be filtered through genome-specific SNPs.


Figure 2.

Phenotypic distribution of four agronomic traits on Cycle 29 Semi-Arid Wheat Screening Nursery (SAWSN). Each panel shows the distribution of best linear unbiased estimates of 254 lines from the Cycle 29 SAWSN used for this study.


Figure 3.

Marker imputation error on 254 breeding lines in the Cycle 29 Semi-Arid Wheat Screening Nursery. For each of 250 randomly chosen markers from the full set of 34,749 genotyping-by-sequencing (GBS) markers, 25 genotypes were masked and the imputed genotypes were compared to observed. Panel A shows imputation error at different levels missing data. The colors indicate what fraction of the 254 genotypes was missing before masking the 25 additional genotypes. The upper and lower limits for the range of different of missing data for different tests are shown in the legend. In panel B, results are shown as a function of the minor allele frequency. The median (column height) and first and third quartile (error bars) statistics are shown for four imputation methods: (i) heterozygote (het), (ii) population mean, (iii) multivariate normal expectation maximization (EM), and (iv) random forest (RF) regression.


Figure 4.

Principal component analysis of breeding lines from the Cycle 29 Semi-Arid Wheat Screening Nursery. Position of 254 wheat lines in the coordinate system defined by the first two principal components using multivariate normal expectation maximization imputed genotypes. The points are color coded according to the seven folds used in the genomic prediction cross-validation scheme. Panel A is based on genotyping-by-sequencing (GBS) markers and panel B is with Diversity Array Technology (DArT) markers.


Figure 5.

Cross-validation accuracy of genomic selection models for predicting line performance in the Cycle 29 Semi-Arid Wheat Screening Nursery, CIMMYT, using genotyping-by-sequencing (GBS) and Diversity Array Technology (DArT) markers on 254 elite breeding lines. Each trait was evaluated using sevenfold cross validation with sister lines from a single cross being grouped in the same fold. Significant differences among marker types within traits are denoted by letters above the bars. The approximate number of markers for each set are in parentheses. The actual numbers of markers are 1729 for DArT, 1827 for GBS (2K), and 34,749 for GBS (35K). Genotyping-by-sequencing (2K) markers have up to 20% missing data per marker and GBS (35K) have up to 80% missing data per marker.