Iron deficiency chlorosis results from the inability of some genotypes to efficiently mobilize iron into the plant when it is growing in high pH calcareous soils. In these soils, ferrous iron is not readily oxidized to ferric iron, and, subsequently, iron availability is limited. Under such conditions, the concentration of iron in the soil is not higher than 100 pmol (Stephan, 2002). Based on its response to Fe availability, soybean is considered a Strategy I plant (Marschner et al., 1986). Strategy I plants first release H+ ions from the root surface into the soil by the proton pumping activity of a H+ adenosine triphosphatase (ATPase). This lowers the soil pH, which in turn initiates the dissociation of Fe(OH)3 complexes into ferrous ions. Next, Fe3+ is reduced by Fe3+ chelate reductase to the more soluble Fe2+. And third, iron transporters move the Fe2+ into the root. Strategy I plants also increase root hair formation, thereby increasing the surface area available for iron uptake (Schmidt, 1999). Once iron has entered the root, it is then moved via membrane transporters into the xylem where it most likely chelates with citrate. The chelated form of iron then moves through the xylem stream to growing leaves. Finally, iron is mobilized from the leaves, forms a complex with nicotianamine, and is transported via the phloem to younger leaves and seed.
Excess water is another factor that accentuates IDC in calcareous soils during the early stages of soybean development. This leads to an elevated concentration of bicarbonates in the root apoplast that impedes the Fe3+–chelate reductase activity necessary for the conversion of Fe3+ to Fe2+. Bicarbonates also immobilize the movement of iron to young leaves once it is absorbed at the root level (Barker and Pilbeam, 2007).
From a genetic perspective, IDC is clearly a quantitative trait where multiple genetic factors are involved in the expression of the proteins necessary for the uptake of iron from the soil and its distribution through the plant. Therefore it was not surprising that studies designed to understand the genetic nature of the IDC response in soybean identified multiple quantitative trait loci (QTL) (Diers et al., 1992; Lin et al., 1997, 2000; Charlson et al., 2003, 2005). These original studies used biparental populations, and the QTL discovered using one population were often population specific (Diers et al., 1992; Charlson et al., 2005). From an applied plant breeding perspective this minimizes the effectiveness of the biparental marker approach. At the same time, these studies reinforce the observation that IDC is complex trait.
Association mapping (AM) is an alternative to discovering genetic factors using biparental crosses. Association mapping uses the linkage disequilibrium (LD) pattern in a large population of unrelated individuals (Risch, 2000). As such, it can identify common genetic variants that control a common phenotype. Given its complex nature, utilizing AM to discover major factors controlling the IDC response in soybean seems to be an appropriate research direction. Indeed, an early AM study in soybean using a limited number of simple sequence repeat (SSR) markers discovered two markers that were reproducibly associated with IDC in two independent populations (Wang et al., 2008). Here a genome-wide discovery effort using the universal 1536 soybean Golden Gate single nucleotide polymorphism (SNP) set (Hyten et al., 2010) with two independent populations was undertaken. Multiple genomic regions distributed throughout the soybean genome were discovered to be associated with the IDC phenotype. A number of genes known to be involved in iron metabolism were found to be closely linked to these loci. The application of regional testing trials for AM studies is also considered.
Materials and Methods
Populations, Phenotyping, and Genotyping
Two independent populations, each consisting of a unique set of advanced soybean breeding lines, developed by public and private breeding programs for 0 and early I maturity groups for the north central states of the United States, were evaluated in 2005 (n = 143) and 2006 (n = 141). The 2005 population was grown at five sites near Arthur, Ayr, Chaffee, Colfax, and Galesburg, ND. The soil at these sites had a pH varying from 7.8 to 8.1, salinity (electrical conductivity) from 0.4 to 0.2 S m−1, and CaCO3 contents ranged from 2 to 11%. Thirty-five seeds were planted in 1.53 m rows on 76.2 cm centers. The experimental design was a randomized complete block design with four replications at each site. Two visual observations were made at each location at the two to three and five to six trifoliolate stages. The second independent population was grown in 2006 at Arthur, Colfax, Galesburg, and Prosper, ND. The soil at these sites had pH varying from 8.1 to 8.3, salinity (electrical conductivity) from 0.02 to 0.08 S m−1 and CaCO3 contents ranged from 2 to 8%. The experimental design and the IDC rating scales were the same as for the year 2005. The visual observations were made at two to three trifoliolate and five to six trifoliolate stages and also 2 wk later. Only two observations could be made at Prosper due to recovery of the plants from chlorosis. Ten standard varieties, listed in descending order of IDC tolerance, were included. These are ‘ISU A11’, ‘Seeds 2000 2070’, ‘Traill’, ‘Council’, ‘Asgrow 0801’, ‘Peterson PFS 0202’, ‘Glacier’, ‘Mycogen 5072’, ‘Stine 0480’, and ‘NuTech 0505’. Iron deficiency chlorosis was ratedon a 1 to 5 scale where 1 indicates no chlorosis and a normal green plant, 2 is used when there is a slight yellowing of upper leaves and the leaf veins and interveinal area do not show a differentiation in the color, 3 shows an interveinal chlorosis in the upper leaves while no obvious stunning of growth or death of tissue (necrosis) could be observed, 4 is used when interveinal chlorosis of the upper leaves is observed along with some apparent stunning of growth or necrosis of tissue, and 5 points severe chlorosis plus stunned growth and necrosis in the youngest leaves and growing points (Wang et al., 2008). All lines were grown in the greenhouse, young leaves were harvested and stored at –80°C, and DNA was extracted using the procedure of Brady et al. (1998). Each sample was genotyped using the Illumina GoldenGate SNP assay with the Universal Soy Linkage Panel (USLP) 1.0 (Hyten et al., 2010).
fastPHASE 1.3 (Scheet and Stephens, 2006) was used to impute missing data for these two sets of loci using “likelihood” based imputation. The default settings were used. Of the 1265 polymorphic markers, 858 and 868 loci, for the 2005 and 2006 populations, respectively, with minor allele frequencies (MAFs) >10%, were selected for analysis. Of these, 816 markers were common to the two populations. The polymorphic information content (PIC) was estimated separately for each population using the PowerMarker software (Liu and Muse, 2005).
Pairwise Linkage Disequilibrium and Linkage Disequilibrium Decay
The extent of LD was estimated as the squared allele frequency correlation (R2) for each populations using TASSEL v.2.1 (Bradbury et al., 2007). Linkage disequilibrium decay graphs were plotted with genetic (cM) or physical distance (Mbp) vs. R2 for each marker pair locus located on the same chromosome using nonlinear regression as described by Remington et al. (2001). The expected decay of LD was estimated according to the following equation:The above equation was described by Pyhajarvi et al. (2007) where n denotes the number of sequences, p = 4Nec between adjacent sites, d is the distance between the two sites of a pairwise comparison, and c is the recombination rate (Hill and Weir 1988). We fit this equation into a nonlinear regression model using NLIN procedures in SAS v. 9.2 (SAS Institute, 2002). The analyses were also performed for individual chromosomes in both the populations.
Population Structure and Kinship
Estimation of population structure (Q) and kinship relationships were derived using only loci that had pairwise R2 values < 0.5 for all possible combinations. In the 2005 population, 306 marker loci met this criterion, while in 2006 population, the number was 303. Population structure was first characterized using STRUCTURE.2.3 to estimate subpopulation membership of each line in these two populations individually (Pritchard et al., 2000). The admixture model with correlated allele frequencies was used with a burn-in of 100,000 and 500,000 iterations for subpopulations numbers ranging from 1 to 15. Five runs for each K value were performed, and the posterior probability was determined for each run. The optimum number of subpopulations was determined by the Wilcoxon two sample t test as described by Rosenberg et al. (2001) by comparing the posterior probability for successive adjacent subpopulations numbers (K2 vs. K3, K3 vs. K4, and so on) using the NPAR1WAY procedure in SAS (SAS Institute, 2002). The smaller K value in a pairwise comparison for the first nonsignificant Wilcoxson test was chosen as the best number of subpopulations. Principal component analysis (PCA) was also used to control for population structure in the two populations individually. The PCA was performed using the PRINCOMP procedure in SAS. The number of principal components (eigenvectors per combination of SNP markers) that collectively explained 25% of the variation was selected for the analysis.
A pairwise kinship coefficient matrix (K-matrix) that estimates the probability of recent coancestory between genotypes (Loiselle et al., 1995) was determined using SPAGeDi 1.2 (Hardy and Vekemans, 2002). The appropriate formula is:where rij is the pairwise kinship coefficient, Fij is an estimator of the coefficient, Qij is the probability of the identity by state between random loci for genotypes i and j, and Qm is the average probability of identity by state for loci from random genotypes in the population used to draw i and j. The Fij was calculated for all pairwise combinations in each of the two populations. Negative values for the kinship matrix was set to zero as described by Yu et al. (2006). PowerMarker software (Liu and Muse, 2005) used to estimate a second kinship coefficient matrix K* (Zhao et al., 2007) that represents the proportion of shared alleles for all pairwise comparisons in each population.
Marker-Trait Association Model Testing
The IDC phenotypic data were analyzed as an adjusted entry mean using the statistical model:where yijk was the mean of the two IDC ratings for the ith genotype in the kth replication at jth location, u is an intercept term, gi was the genetic effect of the ith genotype, lj is the effect of jth location, rjk was the effect of kth replicate at jth location, (gl)if was the effect of genotype × environment interaction, and εijk was the residual. For the adjusted entry means calculation, gi is considered to be a fixed effect. Over all locations and replicates, an adjusted entry means (Mi) was calculated for each genotype as Mi = u′ + g′i, where u′ and g′i denote the generalized least square estimates of u and gi, respectively. This model is essentially the same as that used by Stitch et al. (2008).
Nine different linear regression models were tested for marker-trait association using the MIXED procedure in SAS (SAS Institute, 2002) (Table 1). Six mixed-linear models (MLMs) considered both fixed and random effects while the remaining three general linear models (GLMs) considered only the fixed effects. In these models, y is a vector for phenotypic observations, α is the fixed effects related to the SNP marker, β is a vector of the fixed effects related to the population structure, ν is a vector of the random effects related to the relatedness among the individuals, and ε is a vector of the residual effects. X is genotypes of the SNP markers, P is the matrix of the principle components, K is the Loiselle kinship matrix, and K* is the shared allele kinship matrix developed in PowerMarker (Liu and Muse, 2005). The variances of the random effects were estimated as Var(u) = 2KVg and Var(e) = IVR, where K is a kinship matrix, I is an identity matrix with the off-diagonal elements recorded as 0 and diagonal elements is the reciprocal of the number of the observations for which the phenotypic data were obtained, Vg is the genetic variance, and VR is the residual variance. For each marker, the positive false discovery rate (pFDR) was estimated using the PROC MULTTEST in SAS to correct for multiple marker trait association. For each model, all marker p-values were ranked from smallest to largest, and the mean square deviation (MSD) was calculated as:where i is the rank number, pi is the probability of the ith ranked p-value, and n is the number of markers. Significant markers were selected only from the model determined to have the lowest MSD value for each year. A marker was considered to be repeatable if it was significant based on the pFDR test in each year. The multiple R2 value for all loci were calculated using stepwise regression with forward selection using the PROC REG function in SAS. To detect the epistatic interactions between markers we used a general linear model and the p-value of interaction was used to test the significance.
|Model||Statistical model||Information captured in the model|
|Naïve||y = Xα + ε||y is related to X, without correction for structure (Q or PCA) or relatedness (K or K*)|
|K||y = Xα + Kν + ε||y is related to X, with correction for K|
|K*||y = Xα + K*ν + ε||y is related to X, with correction K*|
|Q||y = Xα + Qβ + ε||y is related to X, with correction for Q|
|PCA||y = Xα + Pβ + ε||y is related to X, with correction for PCA|
|Q + K||y = Xα + Qβ + Kν + ε||y is related to X, with correction for Q and K|
|Q + K*||y = Xα + Qβ + K*ν + ε||y is related to X, with correction for Q and K*|
|PCA + K||y = Xα + Pβ + Kν + ε||y is related to X, along with correction for PCA and K|
|PCA + K*||y = Xα + Pβ + K*ν + ε||y is related to X, along with correction for PCA and K*|
All Arabidopsis thaliana (L.) Heynh. proteins genes demonstrated to be involved in iron metabolism (as summarized in Morrissey and Guerinot, 2009) were used as a query in a blastp analysis against the proteins defined in the 1.01 annotation of the soybean genome (Joint Genome Institute, 2010). Our search was limited the top 20 hits with an E-value cut off of 10−20.
Phenotypic Analysis for Iron Deficiency Chlorosis Scores for the Two Soybean Populations
Since IDC is an important yield limiting factor in soybean, the populations were evaluated in production fields where IDC was consistently noted in the past. The visual IDC scores for the 2005 population ranged from 1.5 to 3.8 with an average of 2.9, while the scores for the 2006 population ranged from 1.6 to 3.8 with an average of 2.7. Analysis of variance for the IDC scores from the 2005 and the 2006 populations showed significant genotype and location effects as well as a significant line × location interaction effect (Table 2). This further substantiates that both genetic and environmental factors influence the IDC response in soybean. Broad-sense heritability on an entry mean basis was also deduced from the ANOVA. The broad-sense heritability values were 0.99 for 2005 population and 0.97 for 2006 population. These values demonstrate the consistency of the IDC rating. The distribution of IDC scores for the two populations (Fig. 1A and 1B) was determined to be normal using the Kolmogorov-Smirnov test with p-values of 0.15 and 0.18, respectively, for the 2005 and the 2006 populations.
|Source of variation||df||MS||df||MS|
|Location × line||569||0.25||420||0.33|
|Replication × location||15||2.11||12||2.08|
Single Nucleotide Polymorphism Marker Analysis
Single nucleotide polymorphism marker information was collected in each population at 1265 informative loci with the Universal Soy Linkage SNP Panel 1.0 (Hyten et al., 2010) using the IlluminaGoldenGate Assay technology. Of the 1265 SNP marker loci, 858 markers in the 2005 population and 868 markers in the 2006 population had a MAF > 10%. The Wilcoxon two-sample test was not significant (p = 0.3439) for the comparison of the major allele frequency in the two populations. The expected heterozygosity is generally low for SNP markers because of their biallelic nature and the selfing nature of G. max. Gene diversity for the 2005 genotypes ranged from 0.19 to 0.50 with an average of 0.39 and for the 2006 genotypes it ranged from 0.19 to 0.50 with an average of 0.39. The markers in both populations were polymorphic with PIC values ranging from 0.17 to 0.38 for 2005 population and 0.17 to 0.38 for 2006 population.
Linkage Disequilibrium Decay, Population Structure, and Kinship Analysis
A nonlinear regression model that estimates the decay of LD with distance was developed. Using a pairwise analysis for all 858 and 868 SNP loci, respectively, in the 2005 and 2006 populations, R2 values were regressed on the physical distances (Fig. 2). The average decay of LD in terms of physical distance declined to R2 < 0.1 at 7.0 Mbp (19.3 cM) and 5.9 Mbp (19.7 cM) in 2005 and 2006, respectively.
From the R2 data, 306 markers from the 2005 population (367,653 SNP marker-pair comparisons) and 303 markers (376,278 SNP marker-pair comparisons) from the 2006 population had R2 < 0.5 among all pairwise comparisons. These two marker subsets were then used to decipher population structure and kinship. Population structure was estimated with the software program, STRUCTURE, using the admixture model for the multilocus genotype data (Pritchard et al., 2000) and the subpopulation number selection criteria of Rosenberg et al. (2001). This analysis determined the 2005 population consisted of seven subpopulations, while the 2006 population comprised three subpopulations. Principal component analysis was also implemented to evaluate population structure. In 2005, 28% of the variance was explained by four principal components, where 11, 6.5, 5.2, and 4.6% are the variance explained by the first to fourth components respectively. In 2006, 29% of the variance was explained by four principal components where 10, 7.6, 6.6, and 4.8% are the variance explained by the first to fourth components, respectively.
Single Nucleotide Polymorphism and Iron Deficiency Chlorosis Marker–Trait Associations
Because of the significant location, line, and location × line interaction effects, we used adjusted entry mean IDC scores in our statistical model analyses. Independent marker-trait associations were conducted for 858 and 868 markers, respectively, for the 2005 and the 2006 populations. These numbers of markers represent the ∼70% of the markers with a MAF > 0.10 in each populations. The genotypic and IDC data were evaluated using nine different models described in Table 1. These models, described in Zhao et al. (2007), are often used in plant AM experiments. Model testing was performed to determine which had the least inflation of significant values at the p < 0.05 level (Table 3). For both years, a model that included population structure and kinship factors had the least percentage of p-values less than 0.05. The ideal model would exhibit a uniform distribution when cumulative p-values are regressed on observed p-values. To observe the degree to which the statistical results for each model deviated from the expected distribution, we calculated the MSD for each model. Again, models that contained a structure component (Q in 2005 and PCA in 2006) and the shared allele kinship component had the lowest MSD values. These models were chosen to select significant SNP marker–trait associations.
|Model||Percent p-values < 0.05||MSD||Model||Percent p-values < 0.05||MSD|
|Q + K*||15.51||0.017||PCA + K*||18.72||0.015|
|PCA + K||18.07||0.028||PCA||23.85||0.036|
|Q||20.86||0.032||Q + K||23.86||0.037|
|Q + K||22.03||0.034||PCA + K||24.45||0.038|
|PCA + K*||50.59||0.203||K*||76.03||0.276|
|K||85.42||0.302||Q + K*||73.97||0.281|
Two independent soybean populations containing advanced breeding lines from public and private breeding programs were evaluated. Lines from 31 programs in 2005 and 30 programs in 2006 were included in the analysis. None of the lines contained the same SNP marker haplotype. We considered each population to be an independent reciprocal confirmation population for any significant markers discovered in the other population. A total of 42 loci met the pFDR < 0.1 criteria for the 2005 population. Of these, 28 fit into a stepwise regression with forward selection and explained 74.5% of the phenotypic variation. For the 2006 population, 88 met the criteria, and 70 of the loci explained 93.8% of the phenotypic variation. None of the markers was significant using the Bonferroni correction factor in 2005, while 13 were significant using this conservative correction factor in 2006. Those 13 accounted for 47.6% of the phenotypic variation.
Next, we used each population as a confirmation population for the other population and only selected those markers that were significant in both years and met the criteria of having a pFDR value < 0.10 in each year. Nine SNP markers, distributed over seven genomic regions on six chromosomes, met the significance criteria (Table 4). None of these nine exhibited epistatic effects in 2005, while several two and three epistatic effects were detected in 2006. Three consecutive markers located within a 408 kb (2.4 cM) window on chromosome Gm3 were significant. These three marker loci were in a high degree of LD (R2 > 0.70 in each year). The MAF for most of the nine loci was >0.3 and for over half of the loci the value was >0.4 (Table 5). The trend was that the IDC means were larger in 2005 than 2006 for both the minor and major allele. This is also reflected in the phenotypic mean differences between the 2 yr. In 2005, the lower mean IDC score at any one marker was ∼2.7. For this population, 40% of the entries had a mean less than this value. For the 2006 population, the trend was that the lower IDC was ∼2.5, and the phenotypic rating of 44% of the entries was equal to or less than this value. For nearly all of the loci, the R2 value within each year was >10%. The only exception was BARC-010457-00640 located on Gm6. The multiple R2 value was calculated for these nine makers. Only one Gm3 marker locus (BARC-044603-08734) was selected in each year to be included in the analysis. For the 2005 population, the multiple R2 value was 43.7%, and for 2006 the value was 47.6%.
|BARC SNP marker||Chromosome||SNP position (bp)||Genetic position (cM)||Minor allele||Major allele|
|BARC SNP marker||Chromosome||–log10(p)||pFDR||R2 (%)||Minor allele frequency||Minor allele mean||Major allele mean||–log10(p)||pFDR||R2 (%)||Minor allele frequency||Minor allele mean||Major allele mean|
Local or regional variety testing trials were established by public institutions to provide consistent data for a specific trait(s) of interest in support of both public and private plant breeding programs. Since these trials are often large in scale (>100 entries), they are ideal for adopting AM techniques because the collection of lines in those trials represents many rounds of recombination. Another major advantage of these trials is that they are performed by individuals highly trained for specific phenotypic evaluations, and therefore the phenotypic data generated on a year-to-year basis is often consistent. Furthermore, since the data are collected over multiple years, any two or more distinct populations can serve as a reciprocal confirmation population(s) for significant marker–trait associations in any one population. Given that the best designed trials are performed at multiple locations that often represent the environmental diversity for a specific agro-ecosystem, it is not surprising that environment × genotype interactions are often observed. Therefore, it is necessary, as we did here, to address this interaction effect by incorporating an adjusted means analysis step before searching for marker–trait associations (Stitch et al., 2008).
A major challenge for AM is to ensure any marker–trait associations are genetically significant and not the result of spurious associations due to population structure and/or relatedness. Regional trials often include multiple genotypes from multiple breeding programs, and as shown for barley (Hordeum vulgare L.), populations from multiple breeding programs can be structured based on the programs (Hamblin et al., 2010). To adjust for these potential confounding effects, it is now standard to evaluate genotypic and phenotypic data using several statistical models that account for population structure, genetic relatedness, coancestry based on pedigree, or some combination of these factors (Zhao et al., 2007; Stitch et al., 2008). Here we compared the results from nine different statistical models, because the effects of controlling complex structure (population structure, principal components, and relative kinship matrices) varies with populations, traits, or both (reviewed in Sun et al., 2010) Given that most of the lines evaluated in the trial were from private programs, we did not have access to pedigree information that would have allowed us to include a coancestry factor. To select the appropriate model, the mean square deviation is an appropriate measure. The principle here is that the distribution of cumulative vs. observed p-values should approximate a uniform distribution. This implies that 1% of the marker–trait p-values should be less than 0.01, that 5% of the marker–trait should have a p-values should be less than 0.05, and so on. If a particular model fits this distribution, then the MSD should be small. In our case, in each year the MSD of a model with a structure component (PCA or Q) and the same shared allele kinship component (K*) was found to be small and about 50% smaller than the second best model. Somewhat surprisingly, the naïve model that does not consider either structure or relatedness had a lower MSD value than several models that did consider these factors. Collectively, these results imply that model testing is necessary for whatever data set is under consideration, and a single model does not perform best in multiple years even when considering the same phenotype.
Iron metabolism in plants involves multiple genes that are associated with the acidification of the soil (using A. thaliana nomenclature, AHA2), iron reduction (AtFRO2), transport into the root (AtIRT1), sequestration of iron in the vacuole (AtIREG2), transport of iron carriers into the xylem (FRD3), distribution of chelated iron to the phloem (YSL transporter family), synthesis of the nicotianamine chelator (NAS3), carriers of iron into the seed (ITP carrier protein family), transport into the seed (OPT transporter family), transport into (VIT1) and out of (NRAMP3 and NRAMP4) the vacuole, iron binding in the vacuole (FER2), and transport into the chloroplast (FRO7 and TIC21) (Morrissey and Guerinot, 2009). In addition, transcription factors such as FIT regulate the expression of genes such as IRT1 (Colangelo and Guerinot, 2004). Given the complex nature of iron metabolism in plants, we were expecting to observe multiple associations with our data. When we applied our cutoff criteria of pFDR < 0.10, we detected 42 significant marker–trait associations for the 2005 population and 88 marker–trait associations for the 2006 population. Given that we were using each population as a reciprocal confirmation population of any association, we next checked for those markers that met the significance criteria in both years. In this case, nine markers distributed over six chromosomes were found to be significant in each year. These loci accounted for ∼45% of the variation in IDC ratings in each year. The availability of data from multiple years of a standard performance trial is an advantage for AM because it provides the necessary data and materials needed to confirm marker–trait associations. By using reciprocal confirmation populations, we are confident that some gene(s) in the vicinity of these markers are involved in the iron response of soybean grown on deficiency-inducing soils.
One of the goals of AM is to use the associations as points of departure to discover the actual genes involved in controlling the phenotype. The most extensive experiment to date was performed in A. thaliana where 107 traits were mapped using genotype data from a high density array (Atwell et al., 2010). For several of these associations, the peak SNP mapped at or very near the gene known to control a specific phenotype. Given that the density of markers in that experiment is much greater than in this experiment, we wished to determine if we could also discover any potential candidate genes.
The average distance between markers in this analysis was 976,723 bp. That value though includes markers found in the sparsely marked repetitive region of the genome. If we only consider markers that map in the euchromatic region of the genome, that distance is reduced to 525,640 bp. Since LD decay at R2 < 0.5, a value where loci are still considered to be in LD, was ∼600 kb in both years, it seemed reasonable to consider a search for candidate genes linked to significant markers within this interval distance. We performed a blastp analysis using A. thaliana proteins genes demonstrated to be involved in iron metabolism (Morrissey and Guerinot, 2009) and queried the 1.01 annotation of the soybean genome (Joint Genome Institute, 2010). The results were limited to the top 20 hits with an E-value cut off of 10−20. A total of 161 soybean gene models met these criteria with a median E-value of 0. We next considered only those genes that mapped within 500 kb of a significant marker. If a nonsignificant marker was located between the gene and the significant marker, that gene was excluded from further consideration as a candidate gene. A total of 15 genes met this criterion from the two populations.
We first evaluated only those iron metabolism genes linked to a marker that was significant in both years. Two gene models, Glyma03 g39050 (annotated as NAS3) and Glyma19 g32400 (OPT1) (Table 6) were identified. In A. thaliana, NAS3 encodes nicotianamine synthase, an enzyme that synthesizes nicotianamine, a molecule that complexes Fe and carries it via the phloem to younger leaves and flowers. OPT1 is a protein that transports Fe into seeds. NAS3 maps within the peak of three consecutive SNPs on Gm3. BARC-060109-16388, the SNP closest to the gene, showed the peak R2 value while the two neighboring markers had lower R2 values. These three markers exhibited high LD values each year (R2 > 0.7) and presumably are signals for the same factor that affects IDC phenotypic expression. Since the marker linked closest to NAS3 had the highest R2 value in each year, it would appear NAS3 is an important factor in IDC tolerance. NAS3 encodes the enzyme that synthesizes the carrier that transports iron out of older leaves and via the phloem to younger leaves and flowers. Importantly, the IDC rating is made on these younger leaves. Therefore, from a physiological perspective NAS3 would appear to be a candidate gene. This result is also consistent with previous genetic research with biparental populations that identified a major QTL on Gm3 (Lin et al., 1997, 2000). Finally, the gene maps in the QTL interval on Gm3 transferred into the soybean line Clark to develop the IDC susceptible line Iso Clark (Severin et al., 2010). Although this is compelling evidence for the role of NAS3 in IDC tolerance, it does not preclude other genes within this region from affecting the IDC response.
|BARC marker||Chromosome||SNP position (bp)||–log10 (p)||pFDR||R2 (%)||–log10 (p)||pFDR||R2 (%)||At gene||Gm gene model||Start of model (bp)||Distance from SNP (bp)||E-value||Percent identity|
|BARC-060109-16388||3||45,391,018||3.32||0.02||16.26||3.78||0.00||13.94||NAS3||Glyma03g39050||45,279,921||111,097||5.00 E × 10–109||62.9|
|BARC-021775-04203||5||41,114,078||0.08||0.53||0.07||2.47||0.03||9.30||BTI2-ITP||Glyma05g37300||40,906,083||207,995||1.00 E × 10–95||65.7|
|BARC-054331-12480||7||8,652,831||1.72||0.09||4.78||BTI2-ITP||Glyma07g10870||9,082,076||429,245||2.00 E × 10–45||36.8|
|BARC-062275-17736||11||38,020,165||0.02||0.56||1.64||2.63||0.02||0.18||FER4||Glyma11g35610||37,245,783||774,382||7.00 E × 10–103||73.2|
|BARC-017917-02456||13||30,457,599||2.17||0.08||7.83||FRD3||Glyma13g27300||30,477,485||19,886||2.00 E × 10–137||54.7|
|BARC-043041-08509||15||48,694,193||2.71||0.05||12.66||0.94||0.26||9.24||IRT1||Glyma15g41620||48,764,845||70,652||8.00 E × 10–80||43.5|
|BARC-012289-01799||18||1,957,710||1.78||0.08||2.07||FER4||Glyma18g02800||1,821,344||136,366||6.00 E × 10–102||75.1|
|BARC-059723-16418||19||40,357,687||3.42||0.02||15.41||4.12||0.00||10.68||OPT1||Glyma19g32400||40,140,993||216,694||3.00 E × 10–146||47.4|
We next evaluated those genes that mapped next to a significant marker in either year. This would allow us to potentially identify candidates that are unique to one of the two populations. Four unique candidates linked to significant markers (p < 0.05 and pFDR < 0.1) were discovered in the 2005 population, while eight were observed for the 2006 population. Most of these markers had a small effect (R2 < 5%). In 2005, the largest effect (12.7%) was noted for the marker near gene model Glyma15 g41620. This model is highly similar to IRT1, the gene that encodes the protein that transports iron into the root. A number of other iron metabolism genes were also observed. When all markers linked to genes involved in iron metabolism were included in a stepwise regression analysis, the R2 value in 2005 was 37% while those markers linked in 2006 had a value of 40%.
Previous research identified IDC QTL on Gm3, Gm5, Gm12, Gm14, Gm18, Gm19, and Gm20 (Lin et al., 1997, 2000). We located these QTL on the 1.01 build of soybean (Schmutz et al., 2010) using the location of the reference SSR and restriction fragment length polymorphism (RFLP) sequences that were associated with the QTL. Those loci were compared to the results presented here. The best match was the Gm19 QTL designated by SSR Satt481. This SSR was evaluated for its utility for marker assisted selection and was found to be effective across locations to select superior IDC lines within a single breeding population. This SSR maps immediately adjacent to SNP BARC-059721-16418, a locus that was significant in both 2005 (R2 = 15.4%) and 2006 (R2 = 10.7%). It also maps near the OPT1 candidate gene that is involved in movement of iron into seed. Two QTL were mapped on Gm3, but neither of these mapped at the association that centers around SNP BARC-060109-16388 at position 45,391,018 bp. Near-isogenic lines (Clark and Iso Clark) were developed that expressed different IDC ratings primarily because of different allelic states of this Gm3 QTL. This QTL was recently mapped by Severin et al. (2010) to a 9.8 Mbp interval (36.3–45.8 Mbp). This interval contains both the Satt481 SSR and BARC-060109-16388 SNP. It is quite possible that this QTL contains multiple genes that individually affect the IDC phenotype. Although none of the other previously identified IDC QTL mapped in both populations, several are located near a marker–trait association we discovered in 2006. Simple sequence repeat Satt211 on Gm5 (Charlson et al., 2003) maps near two significant SNPs (Supplemental Table S1). All of the markers are within 1.2 Mbp of an ITP (iron transporting protein) gene. Simple sequence repeat Satt181 maps near a Gm12 SNP (BARC-030421-06864).
Recently O'Rourke et al. (2009) evaluated the expression pattern of Clark and Iso Clark lines under iron sufficient and iron limiting conditions. Although these two lines primarily differ by the allelic state of a QTL on Gm3 (Severin et al., 2010), many genes were found to be differentially expressed in each line between the two iron nutrition conditions, but only a small number exceed the critical 2x difference in expression. Only a small portion (7–10%) of the differentially expressed genes mapped to previously defined IDC QTL. This result further supports our discovery that IDC is a complex response and poses a major challenge for molecular geneticists as they attempt to discern appropriate genetic signals that can be used to develop marker tools that will efficiently select superior IDC tolerant genotypes.
Here we describe the application of phenotypic and genotypic information collected from a regional nursery to study the genetic architecture of IDC in soybean. This trial provided two populations that served as reciprocal confirmation populations for the discovery of loci associated with this important agronomic trait using genome-wide AM techniques. After correcting for population structure and relatedness, multiple loci were discovered that collectively accounted for much of the variation in the phenotype. Using the extensive knowledge base of the biology and genetics of iron metabolism from the model organism A. thaliana, we were able to discover that a number of genes involved in this physiological network were closely linked to markers shown to be significantly associated with IDC. With these results, this ongoing regional IDC trial will continue to provide additional genetic material to further confirm and refine the results we obtained here.
Supplemental Information Available
Supplemental material is available free of charge at http://www.crops.org/publications/tpg.
Supplemental Table S1: Marker-by-marker association mapping results.