About Us | Help Videos | Contact Us | Subscriptions

The Plant Genome - Original Research

A Long-Read Transcriptome Assembly of Cotton (Gossypium hirsutum L.) and Intraspecific Single Nucleotide Polymorphism Discovery


  1. Hamid Ashrafi *a,
  2. Amanda M. Hulse-Kempb,
  3. Fei Wangbc,
  4. S. Samuel Yangb,
  5. Xueying Guande,
  6. Don C. Jonesf,
  7. Marta Matvienkog,
  8. Keithanne Mockaitish,
  9. Z. Jeffrey Chend,
  10. David M. Stellyb and
  11. Allen Van Deynzea
  1. a Univ. of California–Davis, Dep. of Plant Sciences and Seed Biotechnology Center, One Shields Ave., Davis CA 95616
    b Texas A&M Univ., Dep. of Soil and Crop Sciences, College Station, TX 77843
    c current address, Monsanto Company, Molecular Breeding Technology, 700 Chesterfield Pkwy. West, BB5608-A, Chesterfield, MO
    d Institute for Cellular and Molecular Biology and Center for Computational Biology and Bioinformatics, The Univ. of Texas at Austin, Austin, TX 78712
    e current address, State Key Lab. of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural Univ., Nanjing, Jiangsu, 210095, China
    f Cotton Incorporated, Cary, NC 27513
    g Univ. of California–Davis, Genome Center, One Shields Ave., Davis CA 95616; current address, CLC bio, Cambridge, MA
    h Dep. of Biology, Indiana Univ., 915 E. Third St., Bloomington, IN 47405


Upland cotton (Gossypium hirsutum L.) has a narrow germplasm base, which constrains marker development and hampers intraspecific breeding. A pressing need exists for high-throughput single nucleotide polymorphism (SNP) markers that can be readily applied to germplasm in breeding and breeding-related research programs. Despite progress made in developing new sequencing technologies during the past decade, the cost of sequencing remains substantial when one is dealing with numerous samples and large genomes. Several strategies have been proposed to lower the cost of sequencing for multiple genotypes of large-genome species like cotton, such as transcriptome sequencing and reduced-representation DNA sequencing. This paper reports the development of a transcriptome assembly of the inbred line Texas Marker-1 (TM-1), a genetic standard for cotton, its usefulness as a reference for RNA sequencing (RNA-seq)-based SNP identification, and the availability of transcriptome sequences of four other cotton cultivars. An assembly of TM-1 was made using Roche 454 transcriptome reads combined with an assembly of all available public expressed sequence tag (EST) sequences of TM-1. The TM-1 assembly consists of 72,450 contigs with a total of 70 million bp. Functional predictions of the transcripts were estimated by alignment to selected protein databases. Transcriptome sequences of the five lines, including TM-1, were obtained using an Illumina Genome Analyzer-II, and the short reads were mapped to the TM-1 assembly to discover SNPs among the five lines. We identified >14,000 unfiltered allelic SNPs, of which ∼3,700 SNPs were retained for assay development after applying several rigorous filters. This paper reports availability of the reference transcriptome assembly and shows its utility in developing intraspecific SNP markers in upland cotton.


    cDNA, complementary DNA; dbSNP, Database of Single Nucleotide Polymorphisms; DPA, days postanthesis; E-value, expectation value; EST, expressed sequence tag; HSP, high-scoring segment pairs; InDels, insertion–deletion markers; KASP, Kompetitive Allele Specific Polymerase Chain Reaction; KEGG, Kyoto Encyclopedia of Genes and Genomes; NCBI, National Center for Biotechnology Information; nt, nucleotide; RNA-seq, RNA sequencing; SNP, single nucleotide polymorphism; SRA, Sequence Read Archive; SSR, simple-sequence repeat; TM-1, Texas Marker-1; TSA, transcriptome shotgun assembly

The many applications of molecular markers for in vitro, in silico, and in vivo research and genetic improvement justifies a penchant being placed on their development and characterization for organisms of all sorts, especially organisms of importance (e.g., as research models) for practical utility or as a source of economic benefits. Cotton, the world’s most important textile fiber, is grown in some 80 countries, including the biggest producers, India and China, followed by the United States, Pakistan, and Brazil (Westcott and Hansen, 2014). Although cotton is produced from four Gossypium species, upland cotton (G. hirsutum L.) is the primary source (90–95%). In the United States, the crop is grown from coast to coast, on 10 to 11 million acres, and produces raw products worth more than $5 billion of revenue annually, including fiber and seed (National Agricultural Statistics Service, 2015). As for all crops, there is a well-recognized need to concomitantly increase agricultural sustainability, productivity and economic yield (Bruinsma, 2011).

Given the species’ dominant role in domestic and global cotton production, the genetics, genomics and improvement of G. hirsutum cultivars are especially important. Worldwide efforts have culminated in the identification, mapping and use of large numbers of simple-sequence repeats (SSRs) and other types of markers, but until very recently, very few SNPs have been reported (Van Deynze et al., 2009). Though few SNPs were available at the time of synthesis, many of the available markers, especially SSRs, were recently assembled into a widely used consensus map by (Blenda et al., 2012).

The development of sequence-based markers in cotton is subject to effects of its genomic architecture, like other polyploids (Hulse-Kemp et al., 2014; Kaur et al., 2012). The species of origin for upland cotton, G. hirsutum, is a partially diploidized allotetraploid species, sensu (Comai, 2005), with genomic features that reflect additional levels of paleopolyploidy (Reinisch et al., 1994). Following polyploidization the genome has been shown to have undergone extensive and biased intergenomic nonreciprocal DNA exchanges that changed the genetic architecture of the nascent polyploid genome (Guo et al., 2014). Cytogenetic, genetic, and nucleic acid sequence data indicate that G. hirsutum (n = 26, [AD]1 genome) and the other AD genome (n = 26) Gossypium species evolved from a sexual polyploidization event 1 to 2 million years ago involving two now-extinct ancestral diploids that bore genomic similarities to the extant G. herbaceum (n = 13, A1 genome) and G. raimondii diploid species (Paterson et al., 2012; Wendel et al., 2010; Wendel and Cronn, 2003). The diploid ancestors thus had A- and D-like genomes, respectively. Descendants of the original AD polyploid species spread and diversified; at least five recognized AD-genome species remain today, with genomes designated (AD)1 to (AD)5, respectively (Wendel and Cronn, 2003). Recently, a new AD-genome species was reported (Grover et al., 2015).

Despite various efforts to sequence the (AD)1 genome of upland cotton, a publicly available genome sequence is yet to become available for any AD-genome cotton. However, two genome sequence assemblies of the diploid relative cotton G. raimondii (D5) have been released independently by two research groups (Paterson et al., 2012; Wang et al., 2012), and an A2 genome sequence assembly for G. arboreum was recently released publicly (Li et al., 2014). These genomic resources have fostered rapid progress in Gossypium genomics, including several recently reported large-scale SNP development efforts (Byers et al., 2012; Gore et al., 2012; Hulse-Kemp et al., 2014; Islam et al., 2015; Lacape et al., 2012; Rai et al., 2013; Zhu et al., 2014). A number of these SNP resources were contributed to an international consortium for possible inclusion in a SNP chip, which was released in early 2014 and consists of high-quality high-density SNP arrays, each containing ∼63,000 Infinium assays (http://www.illumina.com/applications/agriculture/consortia.html; Hulse-Kemp, 2014). Limited success has also been achieved in cotton genotyping-by-sequencing methods anchored to restriction sites (Gore et al., 2014; Islam et al., 2015). The usefulness of these new SNP resources has whetted the appetite for more Gossypium SNPs, especially ones directly applicable to research and breeding of tetraploid cotton.

In the current study, we report assembly of transcriptome sequences of TM-1 as a reference and its use for discovery of SNP markers among five G. hirsutum diverse US cultivar types: Texas Marker-1 (TM-1), FiberMax 832, Sealand 542, PD-1, and Acala Maxxa.

Materials and Methods

Plant Materials and Tissues

The seed of TM-1, FiberMax 832, Sealand 542, PD 1, and Acala-Maxxa were planted in the greenhouses of the Department of Soil and Crop Sciences at Texas A&M under standard conditions for G. hirsutum until adult stage. Texas Marker-1 has been used as a standard for cytogenetic and genetics and is an inbred derived from an obsolete cultivar, DPL 14 (Kohel et al., 2002). The other cultivars represent a fair amount of US breeding germplasm diversity but by no means all (Campbell et al., 2011; Lubbers et al., 2003; Van Becelaere et al., 2005). Texas Marker-1 (inbred from DPL 14) represents some mid-South types; FiberMax 832 represents some relatively modern Australian-bred cultivars; Sealand 542 represents a more basal (early-cycle) product from the Pee Dee Station in South Carolina, which captured some Sea Island (G. barbadense) fiber properties in eastern upland types (G. hirsutum); PD-1 represents a somewhat more advanced cycle (4th) of recurrent selection for wide-cross introgression of fiber quality factors into upland types (USDA–ARS & SC AES); and Acala-Maxxa represents Acala types, that is, western US cottons (Shafter Cotton Research Station, CA).

Construction of Illumina Sequencing Library for Five Upland Cotton Lines

Total RNA was extracted from sample of leaves, developing flowers, 5- to 10- and 25-d bolls (containing fibers and seeds) using the Qiagen RNeasy Mini Kit (Qiagen) per the manufacturer protocol. Aliquots were quantified using a NanoDrop spectrophotometer (NanoDrop) and checked for quality by electrophoresis using the Lonza FlashGel System FlashGel RNA Cassettes (Lonza Inc.). Illumina (Illumina, Inc.) RNA-Seq libraries were prepared using the manufacturer protocol. To reduce the highly abundant transcripts, the libraries were normalized using the normalization protocol described in (Matvienko et al., 2013). Briefly, the standard Illumina RNA-Seq libraries were denatured and rehybridized in NaCl and TMAC (3M tetramethylammonium chloride [Sigma–Aldrich]) buffers, and treated with duplex-specific nuclease (Evrogen) (Zhulidov et al., 2004) to digest the high-copy complementary DNA (cDNA) species. The resulting samples were reamplified using Illumina library primers. The libraries were sequenced using Illumina Genome Analyzer II or Hi-Seq 2000 (Illumina, Inc.) for 85 cycles in reverse and forward directions at the University of California–Davis Genome Center.

Construction of Roche 454 Sequencing Library for Texas Marker-1

An equimolar representation of total RNA extracted from 10 samples 0, 3, 5, and 10 d postanthesis (DPA) ovule (maternal, zygotic, and endospermic tissues), as well as 3, 5, 10, and 15 DPA fiber and root, was prepared. One microgram total RNA from TM-1 was used to prepare a partially normalized cDNA library optimized for Roche 454 Titanium sequencing (K. Mockaitis, unpublished data, 2009; Nguyen et al., 2013). Sequencing was performed on an XLR70 instrument (Roche, 454 Life Sciences) according to the manufacturer followed by EST Clean (https://sourceforge.net/projects/estclean/). Library preparation, sequencing, and data cleaning were performed at the Center for Genomics and Bioinformatics, Indiana University.

De Novo Assembly of Roche 454 and Sanger Reads

The Roche 454 data went through our standard preprocessing pipeline using Python scripts (sff_extract_0_2_8.py, available from www.github.com). Reads with low-quality scores were removed and FASTA files of 1,572,820 clean reads were generated. A total of 53,582 TM-1 EST sequences that were available in the GenBank in March 2011 were downloaded. The EST sequences were processed using Alex Kozik’s pipeline (http://cgpdb.ucdavis.edu/SNP_Discovery_CDS) to remove vector sequences and mask the repeats. After processing GenBank EST sequences, 32,576 EST sequences of TM-1 were retained. The 32,576 EST sequences were assembled using CAP3 software (Huang and Madan, 1999) as described in Stoffel et al. (2012) with 90% identity of 80 bp overlapping sequences. MIRA software package version 3.2.1 (http://sourceforge.net/projects/mira-assembler) was used to assemble Roche 454 reads with minimum two reads per contig and minimum 100 bp contig length. The percentage overlap between any two sequences was also set at 90% identity and minimum 20 bp. The assembled EST sequences (EST contigs), along with singleton EST sequences (unassembled EST sequences) and contigs from the MIRA assembly, were combined and a nonredundant set of sequences was obtained by VMATCH clustering software (ver 2.0) (http://www.vmatch.de/). The correction of Roche 454 contigs was carried out by mapping TM-1 Illumina reads to the contigs with high stringency (99% identity) and contigs bases were updated using CLC Workbench version 4.7 (CLC bio, 2011).

Data Availability

The TM-1 Roche 454 transcriptome assembly (only Roche 454 assembled contigs) was submitted to the National Center for Biotechnology Information (NCBI) transcriptome shotgun assembly (TSA) database under BioProject No. PRJNA203021 and TSA accession number GALV00000000.1. The Roche 454 reads that were used for the assembly were submitted under the same BioProject in the Sequence Read Archive (SRA) database, accession number SRR866887. The Illumina reads corresponding to TM-1, FiberMax 832, Sealand 542, PD 1, and Acala Maxxa were also deposited in the SRA database under accession numbers SRR1266110, SRR1266112, SRR1266138, SRR1266113, and SRR1266140, respectively. The SNPs were submitted to the Database of Single Nucleotide Polymorphisms (dbSNP) (NCBI, National Library of Medicine, Bethesda, MD). The dbSNP accessions ss1026504530 through ss1026506433 are available from the NCBI SNP database website. The in-house Perl scripts have been provided as one zip file in the Supplemental Materials of the manuscript.

Annotation of Assembly

The Blast2GO (Conesa et al., 2005) software was used to annotate the TM-1 assembly. Blast2GO involves three main steps: (i) Blastx of the nucleotide sequence against the nonredundant protein database (nr) of NCBI, (ii) mapping and retrieving GO terms associated with the Blastx results, and (iii) annotating GO terms associated with each query in order to relate the sequences to known protein function. Briefly, a Blastx search of contig nucleotide sequences against the nonredundant protein database (nr) of NCBI was performed under the default settings of Blast2GO and the Blastx expectation value (E-value) of 1.0 × 10−3 and maximum 20 hits, high-scoring segment pairs (HSP) length cutoff (default = 33) with low complexity filter was used. The GO terms associated with each Blastx hit were retrieved (mapping step) and GO annotation assignment (annotation step) to the query sequences was carried out using the following annotation score parameters: E-Value Hit Filter (default = 1.0 × 10−6), Annotation Cut-Off (default = 55), GO-Weight (default = 5), Hsp-HitCoverage CutOff (default = 0). In addition, contig sequences were queried for conserved domains or motifs using InterProScan, an online sequence search plug-in within Blast2GO program with all 13 applications selected. The resulting GO terms were then merged with the GO term results from the annotation step of Blast2GO. Kyoto Encyclopedia of Genes and Genomes (KEGG) maps for more than 130 metabolic pathways were generated with the KEGG extension of Blast2GO.

Alignment of the TM-1 Assembly to both G. raimondii and G. arboreum Genome Sequences

The TM-1 transcriptome assembly was aligned to both G. raimondii (Paterson et al., 2012) and G. arboreum (Li et al., 2014) genome sequences assemblies (v.2.2.1 and v.2, respectively) using the GMAP program (Wu and Watanabe, 2005). The alignments were filtered based on length of alignment (minimum 100, 300, 500 and 1000 nucleotides [nt]) and percentage identity (≥96%) between the query (TM-1 assembly) and the subject (reference genome). Both GFF3 (http://www.sequenceontology.org/) and BLAT (https://genome.ucsc.edu) file formats were created. The GFF3 files were added to Jbrowse at CottonGen database (Yu et al., 2013) for visualization of the alignments (http://www.cottongen.org/tools/jbrowse).

Structural Variation Analysis

CLC Workbench version 7.0 (CLC bio, 2013) was used to identify intrachromosomal structural variants in EST sequences and Roche 454-reads of TM-1 compared with G. raimondii (D5) (JGI 221 V2.2.1) (Paterson et al., 2012) as well as G. arboreum (A2) (v.2) genome sequences (Li et al., 2014). We mapped the reads of Roche 454 and EST sequences that were used in the TM-1 transcriptome assembly with the following parameters: Maximum number of locations that a read could map to, similarity, and length fraction of a read were set as 3, 0.96, and 0.8, respectively, with no global alignment. We also set mismatch, insertion, and deletion costs to be 2, 3, and 3, respectively. The resulting mapping file was then used for coverage and structural variation analyses, with p-value threshold <10−4, minimum number of reads = 2, and maximum number of mismatches = 3.

Identification of Putative Single Nucleotide Polymorphisms among Five Upland Cotton Lines

CLC Workbench version 4.7 was used to map (align) the reads of five G. hirsutum lines to TM-1 with the following parameters: read length fraction = 0.6 and similarity fraction = 0.96. Also, mismatch, insertion, and deletion costs were set as 2, 3, and 3, respectively (Table 1). The mapping results were exported as bam files (binary files) into a Linux server. SAMtools (Li et al., 2009) was used to find the variants between the reference sequence (TM-1 assembly) and any given line. The bam files were merged to create a merged bam file of all-cultivar bam files. The all-cultivar merged file had information of all variant positions across all reads of the five cultivars. The variant positions for each cultivar were then extracted from the merged bam file using pileup-cv command in SAMtools and saved as text files. The resulting pileup files including the all-cultivar pileup file were processed by Galaxy (https://main.g2.bx.psu.edu/) stand-alone filter pileup Perl script to eliminate insertion–deletion markers (InDels) or positions with coverage lower than three reads. The all-cultivars pileup file was used as the base file to be merged with five other pileup files using an in-house Perl script (Supplemental Material [In-house Perl scripts file]). Where a variant call information for a position existed in the all-cultivar pileup file and it was missing in one or more of the cultivars, the position in those lines was designated by an N. The resulting file contained information about the base call and coverage for all five cultivars for each variant position in the all-cultivar file. A custom script was used to filter putative SNPs based on the following criteria; at least two genotypes must have been in a homozygous state with different bases with minimum coverage of three reads each.

View Full Table | Close Full ViewTable 1.

Number of Roche 454 reads and expressed sequence tag (EST) sequences that contributed to the Texas Marker-1 (TM-1) assembly and number of Illumina reads of each G. hirsutum lines that were used for single nucleotide polymorphism discovery.

Accession Sequencing platform Number of raw sequences Number of sequences after filtering Average filtered read length Total bp Average filtered reads GC % Number (%) of Illumina reads mapped to TM-1 assembly
TM-1 Sanger (ESTs) 53,582 32,576 762.8 24,849,112 46.6
TM-1 Roche 454 Life Science 1,506,878 1,431,126 389.0 556,321,621 42.8
TM-1 Solexa Illumina GA 60,237,339 56,282,322 80.7 4,539,962,250 42.3 51,379,816 (91.3)
Acala Solexa Illumina GA 61,358,645 53,468,192 41.5 2,217,246,570 42.7 43,422,615 (81.2)
FiberMax 832 Solexa Illumina GA 56,429,774 52,257,485 40.8 2,132,930,095 42.9 42,391,547 (81.1)
Sealand 542 Solexa Illumina GA 50,752,848 50,458,314 41.9 2,112,540,196 43.2 41,479,666 (82.2)
PD-1 Solexa Illumina GA 46,266,096 45,278,905 42.2 1,910,254,843 42.8 36,578,793 (80.8)
These reads and sequences were used to make the TM-1 assembly.

Filtering and Categorizing the Putative Single Nucleotide Polymorphisms

Developing SNP assays that can accommodate different platforms such as TaqMan (www.appliedbiosystems.com) and Kompetitive Allele Specific Polymerase Chain Reaction (KASP) requires having 50 to 100 bp flanking sequence of a putative SNP. Eukaryotic transcriptome sequences suffer from lacking the information of intronic regions as a result the flanking sequence of a putative SNP from transcriptome assembly may not be representative of a true biological sequence. Single nucleotide polymorphism assays that have designed in the boundaries of intron–exon junctions perform poorly in the actual polymerase chain reaction of genomic DNA of a eukaryotic organism (Trick et al., 2009). In order to avoid this problem and to design SNP assays more amenable to DNA amplification, we filtered the SNPs that were in the close proximity of intron–exon junctions. Although intron content is not conserved, there is compelling evidence that intron–exon junctions’ positions are conserved across species (Fourmann et al., 2002; Van Deynze et al., 2007, 2009). The intron–exon junctions of the TM-1 transcriptome assembly were identified by intron finder program from SGN (http://solgenomics.net/). The plausible intron–exon junction positions were thereby determined for each contig. The putative SNP lists were queried against the intron–exon positions list. The SNPs in the neighborhood of 100 bases of intron–exon junction were eliminated.

To increase accuracy of the SNP assays, we further filtered SNPs that were in the close proximity of each other. The impetus for applying this filter was to design only primers that have nearly perfect match with the template DNA. The list of putative SNPs that were not in the vicinity of intronic regions was further filtered to remove any two or more SNPs in the neighborhood of 100 bases of each other. In other words, no two SNPs could exist closer than 100 bases of each other.

The final list of putative SNPs not being in the vicinity of introns and each other was used as an input for another in-house Perl script (Supplemental Material [In-house Perl scripts file]) to categorize them into three classes. The first class included only SNPs derived from contigs with no known homeo-SNP positions in the contig within our data set. The second class included only SNPs with one homeo-SNP in a position on the contig outside of 100 bases window on each side of the putative SNP position. The third class included putative SNPs in the vicinity of more than one homeo-SNP in the contig but with one homeo-SNP within the 100 bases on each side of the SNP position. The purpose in defining the third class of putative SNPs was to allow for design of two homeo-SNP specific primers instead of one common primer to span the homeo-SNP position, in order to identify genome specific SNPs. Therefore, in a KASP assay instead of three primers we used four primers for this class of SNP (see below). The purpose of classification of SNPs was to determine the effect of flanking subgenome specific marker in the vicinity of a putative SNP. As found by Zhu et al. (2014) and in this study, markers free of varietal SNPs within their 20-bp flanking sequence are the most useful markers for assay development.

Identification of Putative Simple-Sequence Repeats

The assembled sequences were used to identify signatures of SSRs. FASTA file containing all the assembled sequences were used as an input file in MISA Perl script (Thiel et al., 2003) to specify the minimum number of the following repeats for microsatellites (unit size, minimum number of repeats): (2,6) (3,5) (4,5) (5,5) (6,5). MISA has the capability of predicting perfect (SSRs with no interruption) and compound SSRs (SSRs with a spacer sequence). The variable to specify the maximum length of the spacer sequence was set as 100 bp in the MISA setup file. The identified SSRs were aligned against (Blastn) all SSRs that to date have entered into CottonGen database (Supplemental File S12).

Primer Design for Simple-Sequence Repeats

The PRIMER3 software (Rozen and Skaletsky, 2000) was used to design forward and reverse primers flanking the SSR containing sequence. An accompanying Perl script of MISA software (p3_in.pl) was used to make the input file for PRIMER3. A second accompanying Perl script of MISA software (p3_out.pl) was used to parse the output file of PRIMER3 into a user-friendly output. The target amplicon size was set as 100 to 300 bp, with optimal annealing primer temperature of 60°C and optimal primer length as 20 nt.

Primer Design for Single Nucleotide Polymorphisms

The BATCHPRIMER3 software (Rozen and Skaletsky, 2000) was used to design allele-specific and common primers for LGC KASP assays (LGC Genomics) according to parameters used in Hulse-Kemp et al. (2014). A set of primers was designed to screen a random set of markers from each SNP categorization. Primers were screened on a panel containing 33 G. hirsutum cultivar lines, G. barbadense line 3-79, and two nontemplate water controls. Markers were scored using the KlusterCaller software to determine if definable and scorable clusters were obtained (good); if not, the marker was deemed not clustered (Supplemental File S1).


De Novo Assembly of Transcriptome Sequences of Texas Marker-1

The TM-1 transcriptome assembly consists of 72,450 nonredundant contigs (Supplemental File S2) including 63,356 (Supplemental File S3) Roche 454 contigs (Tm1_Mira454_Contig_XXXX), 3733 CAP3-assembled Sanger-EST (Tm1_EST_Contig_XXXX), and 5361 unassembled (singletons with GenBank IDs) Sanger-EST sequences from Genbank, for a total of approximately 70.74 million nt and an N50 value of 1100 bp (Supplemental Fig. S1). We created two FASTA files, one listing all nonredundant contigs as described above, hereafter referred to as the full set; and one listing all contigs remaining after removal of the Sanger-EST sequences, hereafter referred to as the reduced set. Having these two files enabled comparisons between the full set and the reduced set in order to calculate the number of public Sanger-ESTs that complement the Roche 454 sequencing data. We submitted only the reduced set assembly into the TSA database of GenBank, which restricts submissions to nonredundant sequences. The reduced set (Roche 454 only) contains 63,356 contigs covering 62.72 million nt and has an N50 value of 1170 bp (Table 2). The full set assembly was used as a reference sequence to map Illumina reads of five G. hirsutum lines including TM-1 as well as other Gossypium species (Hulse-Kemp et al., 2014). Between 80 and 91% of Illumina reads were mapped to the TM-1 full set assembly (Table 1). The quality of assembly was verified by two means. First we selected 11 contigs >5 kb, and tBlastx against all protein database of Gossypium at CottonGen database. All contigs had a hit with similarity between 98 and 100% (data not shown). Second we ran the program TagDust (http://www.ncbi.nlm.nih.gov/pubmed/19737799) to check our reads for presence of artifactual reads. After running the program for 454 reads, we did not detect any reads from adapters or other contaminants. We also ran Blastn program for our contigs against the nonredundant sequences of NCBI and did not detect any traces of microorganismal DNA.

View Full Table | Close Full ViewTable 2.

Comparison of Texas Marker-1 (TM-1) assembly with (full) and without (reduced) public expressed sequence tag (EST) sequences from GenBank.

Statistics Full assembly Reduced assembly
Number of unigenes 72,450 63,356
Total assembled nucleotides 70,743,280 62,726,731
Average GC content (%) 42.2 41.9
Longest contig length 13,697 13,697
Smallest contig length 101 101
Average contig length 976 990
Median contig length 859 846
Numbers of contigs, by size
 <1 Kb 45,659 38,204
 1-2 Kb 23,785 22,197
 2-3 Kb 2,719 2,670
 3-4 Kb 246 244
 4-5 Kb 31 31
 5-10 Kb 9 9
 10-20 Kb 1 1

Annotation of Texas Marker-1 Transcriptome Assembly

The three steps of Blast2Go annotation of the TM-1 transcriptome assembly are presented in Supplemental Fig. S2. A total of 65,080 contigs (∼90%) with an average length of 1024 nt and at least one significant alignment with a protein in the nonredundant database of GenBank (Accessed on 4 Mar. 2014). These contigs covered 66.47 Mb (94%) of the total assembly. However, the 7370 (∼10%) contigs that did not have a hit to any sequence in GenBank were on average 579 nt long and covered 4.27 Mb (6.0%) of the total assembly. The mapping step of Blast2GO identified 65,080 (90%) contigs with GO terms. Significant amounts of mapping data (81.2%) were derived from the UniProKB database (http://www.uniprot.org/help/uniprotkb), followed by TAIR (17.9%) (http://www.arabidopsis.org) and GR_Protein (0.8%) (http://archive.gramene.org/protein/). In addition, six other databases were searched (listed in Supplemental Fig. S3), but in total they contributed annotation to only 0.18% of the mapping data. Between 0 and 81 GO terms were assigned per sequence with a weighted average of five GO terms per contig (Supplemental Fig. S4). The frequency of GO terms for shorter sequences (<2.5KB) was more than that of longer sequences. The percentage of annotated sequences increased proportionally with their length and decreased at or about 1Kb, however sequences longer than 4Kb were 100% annotated (Supplemental Fig. S5). As expected, the majority of annotations were inferred electronically compared with direct assays (14%) (Supplemental Fig S6; Supplemental File S6). By counting all significant hits in the Blastx result table, Vitis vinifera L., Populus trichocarpa Torr. & A. Gray, and Cucumis sativus L. were the top three species in terms of hit number (Supplemental Fig. S7,S8). InterProScan, Annex, and GO annotation query through more than 16 databases significantly increased the number of GO terms by 18% (13,275). In total, 62,106 (86%) contigs had at least one GO term. In addition, 11,884 (16%) of the contigs were predicted to be enzymes (with enzyme codes). Direct GO count graphs were created to categorize the sequences based on their biological processes and molecular functions as well as their cellular component. Based on their biological processes, sequences involved in cellular process, metabolic process, and response to stimulus had the maximum frequencies. In terms of molecular function, nucleic acid binding elements comprised the highest numbers of sequences in the TM-1 assembly, followed by catalytic activity and transported activity elements. Cellular component constituents of membrane-bounded intracellular organelle, cytoplasm and cytoplasmic part, and plasma membrane were among sequences that had the maximum numbers in the assembly (Supplemental Fig. S9–S11).

Alignment of the Texas Marker-1 Assembly to Diploid Cotton Reference Genomes

Gossypium raimondii Genome Sequence

A total of 70,697 unique contigs out of 72,450 contigs of the TM-1 transcriptome assembly had at least one hit to the G. raimondii genome sequence to generate a redundant hit list of 82,421 hits. As previously shown (Flagel et al., 2012), we found that a high stringency is necessary to discern the contigs that are only originating from the A or D genome in the TM-1 assembly; the list of aligned contigs was further filtered based on ≥96% of identity of a given contig to the reference sequence. The copy numbers of aligned contigs varied from one to five on the G. raimondii genome (Fig. 1). As Fig. 1 depicts, the majority of shorter sequences (100–500 nt) had one single copy in the D-genome assembly. Increasing the length of alignment from 500 to 1000 significantly reduced number of hits by >50%. Interestingly, contigs with five copies are more frequent than contigs with four copies. The distribution of 49,678 (70.4% of all alignments) alignments with length > 500 bp and known chromosomal position across G. raimondii genome were visualized to have a better understanding of the distribution of TM-1 transcriptome contigs across the D genome (Fig. 2). Only 146 (0.3%) contigs in this dataset were mapped to scaffolds with unknown chromosomal location (data not shown). Considering the fact that cotton chromosomes are mainly meta- and sub-metacentric (Sheidai et al., 2008; Wang et al., 2008), contigs were more concentrated in both ends of each chromosomal arm with some degree of skewness on chromosomes 6, 7, and 11 (see Supplemental Files S5 and S6 for BLAT and GFF file formats of the alignments).

Figure 1.
Figure 1.

Mapping TM-1 contigs on G. raimondii genome (D5) sequence. The graph depicts the relationship between copy number of TM-1 transcriptome assembly contigs aligned to the G. raimondii genome at different filtering levels. The smaller the length of the sequence, the more hits on the genome to a maximum of five hits.

Figure 2.
Figure 2.

Density of TM-1 long-read transcriptome assembly across the 13 chromosomes of G. raimondii. Each panel depicts the distribution of TM-1 fragments and contigs >300 bp aligned to a different chromosome of the G. raimondii genome.


Gossypium arboreum Genome Sequence

A total of 70,804 contigs out of 72,450 contigs of the TM-1 transcriptome assembly had at least one hit to the G. arboreum genome sequence to generate a redundant hit list of 83,270 hits from the smallest 10 bp alignment to maximum 12,986 bp with 16 exonic regions. This list was further filtered based on ≥96% identity between a given contig to the reference sequence and then filtered based on the length of aligned fragment from minimum 100 to 1000 bp (Fig. 3). As Fig. 3 depicts, the majority of alignments have one copy with short length. As the length of alignment increased from 100 to 1000 bp, the number of segments of contigs aligned to the genome decreased. Similar to G. raimondii alignment, the number of five-copy segments was slightly higher than that of four-copy alignments. Distribution of the 50,479 segments that aligned on G. arboreum genome is depicted in Fig. 4. Although the percentage of all TM-1 contigs that aligned to the A2 genome (69.7%) was similar to the D5 genome, the distributions of aligned fragments along chromosomes of the A2 genome assembly differed markedly from that of the D5 genome assembly (Fig. 2), that is, the distribution is more random along 13 A2 genome pseudomolecules, with alignments being relatively numerous in central regions of pseudomolecules than observed for the D5 pseudomolecules (see Supplemental Files S7 and S8 for BLAT and GFF file formats of the alignments).

Figure 3.
Figure 3.

Mapping TM-1 contigs on G. arboreum genome (A2) sequence. The graph depicts the relationship between copy number of TM-1 transcriptome assembly contigs aligned to the G. arboreum genome at different filtering levels. The smaller the length of the sequence, the more hits on the genome to a maximum of five hits.

Figure 4.
Figure 4.

Density of TM-1 long-read transcriptome assembly across the 13 chromosomes of G. arboreum. Each panel depicts the distribution of TM-1 fragments and contigs >300 bp aligned to a different chromosome of the G. arboreum genome.


Comparative Analysis to the Two Alignments

Among 72,450 contigs in the assembly, 71,688 unique contigs mapped to the A2 or D5 genomes. This includes all hits before filtering for length or percentage identity. From among 71,688 contigs that mapped in both reference sequences, 69,812 were common between the two mapping efforts. A total of 1876 contigs were singlets from which 884 mapped only to D5 and 992 mapped to A2 reference sequences. We performed the same analysis on filtered data with 96% identity and minimum 500-bp long mapped fragments (Fig. 2,4). From among 56,639 contigs that mapped to both A2 and D5 genomes, 41,872 were mapped in both genomes and the remaining 14,767 contig were mapped to A2 (7188) or D5 (7579) genomes uniquely. The use of unfiltered data overestimates the number of contigs that map to each genome due to mapping small fragments of every contigs to the genome. However, filtered data discriminated the number of contigs that map to each genome with a higher accuracy and more efficiently separated the data according to similarity with the A2 and D5 genome contigs. The common contigs as well as those that were found to be unique to the A2 or D5 genome for this (filtered) analysis have been listed in Supplemental File S9.

Alignment of the Texas Marker-1, Roche 454 Reads, and EST Sequences to G. raimondii Genome Sequence

The reference D5 genome consists of 754.0 Mb with 37,505 genes and 77,267 transcripts with an average 1850 nt per protein coding sequences. In the D5 genome, over 22,000 genes have only one transcript while the remaining genes (∼15,000) have more than one transcriptome form. A total of 1,388,296 (94%) of input reads aligned to 13 pseudomolecules and 120 unplaced scaffolds of G. raimondii genome sequence. A total of 95,054 (6%) nonspecific reads did not map. Almost one third (35%) of reads mapped without any gap, while the rest (65%) had gaps (Table 3). Our transcriptome data covered 6% of the D5 genome. Excluding noncovered regions, the average coverage was 12.17 ± 22.97 reads. Since the coverage cannot be negative, the large standard deviation indicates that there was a large variation among locations in the numbers of reads that mapped, which is expected from RNAseq data and from biology of the genes.

View Full Table | Close Full ViewTable 3.

Mapping statistics of Roche 454 Life Science reads and expressed sequence tag (EST) sequences to the G. raimondii and G. arboreum genome assemblies.

G. raimondii (D5)
G. arboreum (A2)
Number of reads Percentage Number of reads Percentage
Mapped reads 1,368,723 93.5 1,342,674 91.7
Ungapped reads 476,562 32.6 465,423 31.8
Gapped reads 892,161 60.9 877,251 59.9
Unmapped reads 95,054 6.5 121,103 8.3
Invalid match 48,794 3.3 47,368 3.2
No match 46,260 3.2 73,735 5.0
Total reads 1,463,777 100.0 1,463,777 100.0
Reference length: G. raimondii = 753,951,485 (753 Mb) and G. arboreum = 1,532,037,843 bp (1.53 Gb).
The unmapped read list contains the reads that the large gap mapper was not able to map. The reads for which the large gap mapper was able to find a mapping, but for which the mappings of the segments where incompatible, are put in the invalid mapped reads list. The mappings of the segments of a read are incompatible if their positions are not consecutive along the reference or if they do not have the same direction.

Alignment of the Texas Marker-1, Roche 454 Reads, and EST Sequences to G. arboreum Genome Sequence

A total of 1,342,674 (92%) of reads including Roche 454 and public EST sequences were aligned to the 13 pseudomolecules of G. arboreum genome sequence. Unmapped reads accounted for 8% of Roche 454 sequences, similar to the above results for G. raimondii (Table 3). The A2 draft genome consists of 1.53 Gb with 39,522 coding sequences. Total consensus length of the mapped reads was 43,945,455 bp covering 3% of the cotton A2 genome. This result is very similar to the D5 genome. Excluding noncovered regions, the average coverage was 12 reads with a large standard deviation.

Structural Variation Analysis

We chose a high stringency level (80% of the read length with 96% sequence identity) to be able to map the reads to the D5 genome from among the blend of A2 and D5 genomes (G. hirsutum). Within the genic regions, we identified 15 deletions of various sizes that ranged from 65 to 440 bp on eight out of 13 chromosomes of the D5 genome (Supplemental Fig. S12a–m). We did not identify any major insertions or inversions. With regard to alignment of the Roche 454 reads to the A2 genome with identical stringency to D5 genome mapping, we identified evidence for deletion of a 405 bp region in two locations on chromosome 3 in proximity of 93,000 bp from each other and evidence for an insertion of a 177 bp region on chromosome 6. In addition, we identified 30 InDels on nine chromosomes (all but chromosomes 4, 5, 8, and 11) varying in size from 4 to 150 bp.

Putative Single Nucleotide Polymorphism Identification

A total of 14,611 putative unfiltered SNPs in 9152 contigs were identified, that is 1.5 ± 0.5 SNPs per contig (Supplemental File S10). The SNPs included a total of 8619 transition events (AG/GA, CT/TC) and 5992 transversion events (GC/CG, AT/TA, AC/CA, GT/TG), that is, a transition/transversion ratio of 1.44. The 9154 contigs deemed to have SNPs collectively comprise 9650 Kb of the G. hirsutum TM-1 transcriptome. Filtering putative SNPs within 100 bases from each side of intronic regions removed 3518 SNPs, and the remaining 11,096 SNPs were further filtered for the presence of other adjacent SNPs in 100 bases in each side of the SNP. After filtering, a total of 5618 SNPs were remaining that were used for filtering against genome-specific SNPs (homeo-SNPs) and other types of nonallelic SNPs. Homeo-SNPs were defined at positions where complementary homozygous alleles were not represented and three to five cultivar genotypes were heterozygous. For instance, a combination of A-A-R-R-R genotypes was considered a homeo-SNP while an A-G-R-R-R combination was considered as a putative SNP (R is IUPAC code for A/G). Based on this definition, putative SNPs were sorted in three classes (See Materials and Methods). Class I consists of 1669 SNPs, Class II had 1902 SNPs, and Class III 1008 SNPs. Where it was possible, we extracted 100 bases on each side of putative SNPs (201 nt). All SNPs were screened to be 100 bases apart from the first position on the contigs. For the SNP positions located <100 bases from the end of contigs, extracting 100 bases flanking sequence was not possible, which reduced the number of SNPs to 1372 Class I, 1557 Class II, and 784 Class III. We aligned the flanking sequence of pooled SNPs (3713 SNPs) to the A2 and the D5 genome references using the Blastn program of BLAST (Altschul et al., 1990). The Blastn results were filtered to select hits with ≥99% identity to reference with ≥199 perfect matches. A total of 773 SNPs hit the A2 genome and 772 SNPs hit the D5 genome sequence with some level or redundancy within each list. Out of 1505 nonredundant hits to both A2 and D5 genomes, 124 SNPs were common in both genomes, while the remaining 1257 were specific to A2 (649) or D5 (608) genome sequences (Supplemental File S11).

Validation of Putative Single Nucleotide Polymorphisms

For Class I and Class II, a random set of 29 and 25 SNPs, respectively, were selected (Supplemental File S1). Assay primers were designed and used in a KASP assay (http://www.lgcgenomics.com) on a panel of genotypes including the five G. hirsutum lines that were used for SNP discovery. Class I had the highest validation rate of 21 out of 29 (72.4%) followed by Class II with four out of 25 (19%). In silico-defined Class III SNPs have performed poorly in previous validation experiments run in our laboratory (D.M. Stelly, unpublished data, 2013), therefore we eliminated the Class-III SNPs defined here from subsequent validation experiments.

Identification of Putative Simple-Sequence Repeat Markers in Texas Marker-1 Transcriptome Assembly

From among 63,356 Roche 454 nonredundant contigs, 3920 (6.2%) unigenes contained SSRs, from which 371 unigenes bore more than one SSR marker signature. A total of 4165 SSR markers with simple repeats and 179 SSRs with compound formation were identified. Using PRIMER3 software, we were able to design primers for 1698 putative markers in the assembly. In the TM-1 assembly, dinucleotide AG/CT was the most frequent SSR motif, followed by AT/TA and AC/GT. The trinucleotide motif AAG/CTT was the most frequent in the assembly. Longer motifs such as tetra- and pentanucleotide motifs were less frequent than di- and trinucleotide motifs (Supplemental Table S1; Supplemental File S12). Where possible, the primers were designed for SSRs in the assemblies (Supplemental Table S2; Supplement File S12). The identified putative SSRs were compared with all 65,412 SSR markers in CottonGen (http://www.cottongen.org/). A total of 115 hits with >90% identity were identified. Due to redundancy, the 115 hits correspond to 81 markers in CottonGen database from which 42 were previously mapped (Supplemental File S12; Supplemental Table S3–S5).


We report here a transcriptome assembly of G. hirsutum cultivar TM-1 that is based on Roche 454 sequencing and public EST sequences of the same cultivar that are available in GenBank (NCBI). A major impetus for creating the assembly was to have a more or less comprehensive transcriptome assembly that would suitable as a reference for mapping and analysis of short-read RNAseq data for SNP discovery. The relatively long-read technology of the Roche 454 was considered superior for this purpose to short-read technologies available at that time (e.g., Solexa). The publically available Sanger transcriptome sequences were used to augment the assembly effort. The G. hirsutum transcriptome assembly presented here consists of 72,450 nonredundant contigs. The average length of contigs (976 nt) was very close to that of a recently reported transcriptome assembly of cotton (Rai et al., 2013). Despite availability of genome sequences of G. raimondii (D5) (Paterson et al., 2012) and G. arboreum (A2) (Li et al., 2014), the transcriptome assembly of G. hirsutum is important in many aspects. As we showed in this study, it can be used as a reference sequence to identify intraspecific SNPs and SSRs. It can also be used to identify candidate genes related to important traits. The GO annotation of the transcriptome assembly showed a high percentage of the contigs have at least a similar hit in the GenBank; this indicates that the assembly has high quality with minimum redundancy. The sequences that did not have any match in the database are likely novel transcripts that are unique to cotton, the cotton family has not been annotated, or data was not available at NCBI at the time of analysis. These sequences were short and constitute a small portion of our assembly.

We chose to assemble transcriptome sequences de novo using long Roche 454 and Sanger reads from multiple tissues important to cotton breeding from a single genotype, TM-1. To that end, we used Illumina reads of TM-1 to exclude the reference bases that were inconsistent between reference and consensus Illumina base (circumventing 454 wrong base call). Although genomic sequences are available for D5 and just recently for A2 genomes, there are no currently available assembled genomic sequences for allopolyploid G. hirsutum (AD)1. In previous work, we have shown that 95% of sequences derived for five D genome species map to the D5 genome (Paterson et al., 2012) and only 68% of sequences from the A2 genome diploids map to the D5 genomic sequences (Page et al., 2013). By mapping G. hirsutum to a single diploid genome, one would significantly bias results towards the homologous genome. Furthermore it would be presumptuous to assume that after 1 to 2 million years after polyploidization the G. hirsutum genomes would be accurately represented by their putative diploid progenitors, much less extant relatives, for example, D5- and A2-genome species. See Table 3 for summary of reads mapped to A, D, both, and neither genomes under stringent conditions. Furthermore, an assembly (rather than individual reads) of a deep transcriptome is the basis of de novo gene model development and annotation of genomes. The resources provided in this paper will serve to annotate the G. hirsutum genome.

Since our SNP discovery pipeline was independent of reference base call, the additional step to verify the accuracy of reference base had two advantages. First, it reduced the error of miscalling a position a putative SNP by making sure that the base call of both reference and consensus of Illumina reads are identical. Second, it provided us with additional confidence for the accuracy of the SNP call to extract the flanking sequence from the assembly in order to generate a more reliable assay. Our SNP validation experiment showed that the SNPs that were not in the vicinity of intronic regions or homoeologous SNPs were more reliable than those in the neighborhood of 100 bases of those positions.

We aligned both assembled reads (TM-1 assembly contigs) and unassembled reads (TM-1) to the G. raimondii and G. arboreum genomes. Alignment of assembled reads to an assembly using programs like GMAP will result in an output that contains a wide range of alignment sizes with various degrees of similarities between the query reference sequences. We filtered the results very stringently according to percentage of identity between the aligned transcriptome and the reference genome as well as considering only long alignments. We speculate the contigs that did not align well (29.6%) with the D5 genome are specific to the A subgenome or their sequence did not exist in the D5 genome assembly. Likewise, reads that did not match exactly to the A2 genome sequence (30.3%) could belong to the D subgenome. Mapping the unassembled transcriptome reads to both genomes is another way to discern the A subgenome reads from the D subgenome reads. By increasing the alignment stringency one can distinguish the reads of each subgenome in an organism with higher ploidy level such as cotton. Mapping Roche 454 reads to the D5 genome assembly enabled us to identify structural variants in the D5 genome compared with the TM-1 transcriptome. We identified two and 15 relatively large deletions in genic regions of the A2 and D5 genomes respectively, indicating that these regions most likely were deleted from G. hirsutum after polyploidization.

Cultivated G. hirsutum is recalcitrant to SNP identification efforts due to its narrow germplasm base and its allotetraploid nature. Despite this fact, and a sampling of just five cultivar transcriptomes, we were able to identify >14,000 unfiltered and >3500 filtered SNPs. We also validated Class I and II SNPs. Within a subset of Class I SNPs the validation rate was 72%, a high validation rates for intraspecific G. hirsutum SNPs. We used one pooled Roche 454 cDNA library for transcriptome assembly with 1.5 million reads compared with another study (Rai et al., 2013) in which they used two genomic libraries of each six diverse G. hirsutum genotypes that resulted in generating >14 million Roche 454 reads. Having a larger dataset, Rai et al. (2013) were able to identify over 66,000 nonredundant SNPs among the six lines that they used within the genic and nongenic sequences, though the validation rate was not specified (Zhu et al., 2014). The number of putative exonic SNPs that they identified was close to the total number of putative SNPs that we identified, however we used a very stringent filtering criteria by which three-fourths of SNPs were eliminated. They also observed a higher ratio of transition to transversion (1.9) in their entire dataset likely due to the inclusion of genomic sequences. Intraspecific SNPs were also identified between two G. hirsutum genotypes (Acala-Maxxa and TX2094) using an EST assembly (Byers et al., 2012). They identified 3319 putative SNPs in contigs where homoeologs did not coassemble and 1009 putative SNPs in coassembled contigs. The validation rate of these SNPs was about 31%, based on segregation a 1:2:1 or 3:1 ratio in an F2 population of a cross between Acala-Maxxa and TX2094.

We investigated the presence of SSRs in the TM-1 assembly because SSRs are still important, highly portable markers that continue to be used extensively in a number of laboratories (Blenda et al., 2012; Jena et al., 2012; Li et al., 2012). Similar to our study, both Rai et al. (2013) and Byers et al. (2012) identified SSR motifs in their assemblies. Byers et al. (2012) and Rai et al. (2013) identified ∼1300 and 136,522 SSRs markers with two to five repeat motifs, respectively. Considering the size and nature of our assembly, the number of SSRs (4165) that we identified was more than three times the number identified by (Byers et al., 2012) and far less than the number identified by (Rai et al., 2013). In all studies, dinucleotide repeats were most abundant followed by trinucleotide repeats. We have not experimentally assessed these SSRs as markers, but with continuous improvements in genome assemblies and maps, it is likely any researcher can make use of them in a strategized manner by selecting according to sequence or map locations prior to testing their effectiveness.


Transcriptome-derived SNPs can meet at least some of the need for large numbers of intraspecific markers in upland cotton breeding and research programs. Our approach demonstrated that assembly and mining of the cotton transcriptome for intraspecific SNPs were reasonably feasible and affordable. Without the benefit of an assembly for G. hirsutum genome, our strategy was to develop a transcriptome assembly from long-read (454) sequencing that could serve as a common reference for mapping short-reads from high-throughput sequencing. With additional analysis of mapped reads, this strategy enabled thousands of high-quality markers to be identified. Many of the markers generated in this study have recently been incorporated into a new Illumina SNP chip and will therefore be widely used by the research and breeding communities to, as an example, enable marker-assisted selection and breeding programs around the world.

Supplemental Materials

The manuscript is supplemented with 10 MS Excel (xlsx format) and two FASTA format files to help readers have full access to our data. There is also a power point file (pptx format) for supplementary figures and a zip file for all in-house Perl scripts.


Portions of this study were financially supported by Cotton Incorporated (13-626 and 13-694), Texas AgriLife (Project 7036), and the National Science Foundation Plant Genome Research Program (DBI1025947). We thank Zach Smith and James Ford (IU) for generating the Roche 454 library and sequencing instrument runs, respectively. We thank the sequencing facility of UC Davis Genome Center and Bioinformatics core facility for providing servers and computational power. We also thank the AgriGenomics Laboratory at TAMU for access to the KASP SNP genotyping pipeline.




Be the first to comment.

Please log in to post a comment.
*Society members, certified professionals, and authors are permitted to comment.