About Us | Help Videos | Contact Us | Subscriptions

The Plant Genome - Article



This article in TPG

  1. Vol. 3 No. 2, p. 69-80
    Received: Apr 23, 2010

    * Corresponding author(s): jose.gonzalez@sdstate.edu


Investigation of the Transcriptome of Prairie Cord Grass, a New Cellulosic Biomass Crop

  1. Kristene Gedye,
  2. Jose Gonzalez-Hernandez ,
  3. Yuguang Ban,
  4. Xijin Ge,
  5. Jyothi Thimmapuram,
  6. Fengjie Sun,
  7. Chris Wright,
  8. Shahjahan Ali,
  9. Arvid Boe and
  10. Vance Owens
  1. K. Gedye, J. Gonzalez-Hernandez, A. Boe, and V. Owens, Department of Plant Sciences, South Dakota State Univ., Brookings, SD 57007; Y. Ban and X. Ge, Dep. of Mathematics and Statistics, South Dakota State Univ., Brookings, SD 57007; J. Thimmapuram, F. Sun, and C. Wright, W.M. Keck Center for Comparative and Functional Genomics, Roy J. Carver Biotechnology Center, Univ. of Illinois at Urbana-Champaign, Urbana, IL 61801; S. Ali, Biosciences Core Lab.-Genomics, King Abdullah Univ. of Science and Technology, Thuwal 23955-6900, Kingdom of Saudi Arabia.


Prairie cordgrass (Spartina pectinata Bosc ex Link) is being developed as a cellulosic biomass crop. Development of this species will require numerous steps, including breeding, agronomy, and characterization of the species genome. The research in this paper describes the first investigation of the transcriptome of prairie cordgrass via Next Generation Sequencing Technology, 454 GS FLX. A total of 556,198 expressed sequence tags (ESTs) were produced from four prairie cordgrass tissues: roots, rhizomes, immature inflorescence, and hooks. These ESTs were assembled into 26,302 contigs and 71,103 singletons. From these data were identified, EST–SSR (simple sequence repeat) regions and cell wall biosynthetic pathway genes suitable for the development of molecular markers which can aid the breeding process of prairie cordgrass by means of marker assisted selection.


    AFLP, amplified fragment length polymorphisms; bp and Mb, base pair and megabase, respectively; EST, expressed sequence tags; GO, gene ontology; PAGE, polyacrylamide gel electrophoresis; PCG, prairie cordgrass; PCR and ePCR, polymerase chain reaction and electronic PCR, respectively; nt, nucleotide; SSR, simple sequence repeat

Prairie cordgrass (PCG) is a native grass species of the North American Prairie and has recently gained attention as a species suitable for the production of cellulosic biomass for producing biofuel (Gonzalez-Hernandez et al., 2009). PCG is a warm season, C4 grass with a geographical range extending from Texas to the Arctic Circle (USDA-NRCS, 2008) and having the highest latitude of distribution of any C4 species (Potter et al., 1995). PCG has been utilized ecologically for revegetation, stream bank stabilization, and habitat development. In addition, this species has been used as a forage species (Boe et al., 2009). PCG is able to survive in a wide variety of environmental conditions, from open arid prairies and high railroad embankments to wet areas with high salinity and insufficient aeration of the soils where neither switchgrass (Panicum virgatum L.) nor maize (Zea mays L.) can be grown to their full potential (Boe et al., 2009). The inherent tolerance of PCG to these environmental conditions will allow the production of cellulosic biomass from land that is unsuitable for conventional crop production.

On the basis of these favorable growth characteristics, research attention is being directed toward the development of PCG as a cellulosic biomass crop species. Indubitably, genetic improvements will be made to PCG to enhance this species for utilization as a source for biomass. These genetic changes could occur via conventional breeding and marker assisted selection breeding or by incorporation of favorable genes (or characteristics) through plant transformation techniques. To facilitate any genetic improvement in PCG, comprehensive knowledge of the genome and its gene expression profiling is essential. Limited genetic studies have been performed in PCG to date. Genetic diversity of populations of PCG with amplified fragment length polymorphisms (AFLP) has been studied (Moncada et al., 2007; J. Gonzalez, unpublished data). The development of molecular markers for breeding can be achieved rapidly by utilizing techniques such as enriched genomic libraries and parallel pyrosequencing techniques. Both of these methods can provide considerable amounts of information about the genome structure of this species. In PCG, genomic libraries enriched for SSRs have been developed and utilized to compile a linkage map (J. Gonzalez, unpublished data).

Many variables will need to be quantified and potentially altered to ensure optimal production of biofuel from PCG. The identification of beneficial alleles associated with these potential variables, via methods used in the more examined crop species (i.e., wheat, Triticum aestivum L., and rice, Oryza sativa L.), may not be feasible or viable in PCG. Attention may be turned toward the use of comparative and functional genomics approaches to identify putative genes in PCG. To achieve this goal an understanding of the relatedness between PCG and other model grass species needs to be undertaken. Such studies have been conducted previously by means of expressed sequence data in switchgrass, another species of interest for biofuel production, and in the model system of sorghum [Sorghum bicolor (L.) Moench] (Tobias et al., 2008).

The objective of the work presented in this paper was to investigate the transcriptome of PCG. This analysis provides the first snapshot of the PCG transcriptome. This was achieved by the sequencing of cDNA isolated from four stages of PCG growth: root, rhizome, hook, and immature inflorescence in a 454 GS FLX sequencer. The sequencing of the PCG cDNA provided a library of ESTs which was assembled de novo to produce consensus sequences (contigs) with minimal redundancy. This paper describes an initial analysis of the transcriptome library detailing the identification of; molecular markers (SSRs and EST–SSRs), the mining of gene sequence (specifically the lignin biosynthetic pathway), and comparative genomic analysis to other species [S. bicolor, Z. mays, P. virgatum, sugarcane (Saccharum sp.)], the Poaceae family, and a nonredundant database.


Plant Material

From a collection of PCG germplasm, three lines were selected for RNA extraction. These lines were grown in standard greenhouse conditions at South Dakota State University (Brookings, SD). From one line (designated RR21), a section of roots, rhizome, and “hooks” were collected, while from two other lines (designated Sp3.1A and Sp13.4A), immature inflorescences were collected. Line RR21 was selected because it has been used to develop a segregating population. The other two lines (Sp3.1A and Sp13.4A) were chosen because they were the only members of our germplasm collection that were flowering at the time of collection. The four tissue types were immediately flash-frozen in liquid nitrogen on collection and stored at −80°C until use. The four tissues were transported to the W.M. Keck Center for Comparative and Functional Genomics at the University of Illinois (Urbana, IL) on dry ice where total RNA was extracted from these tissue samples by Pure Link Plant RNA extraction kit (Invitrogen, Carlsbad, CA). mRNAs were isolated from individual samples by the Oligotex Kit (Qiagen, Valencia, CA). Equal amounts of mRNA from different samples were pooled, and cDNA was synthesized from pooled mRNA. cDNA was then normalized by means of a Trimer-Direct kit (Evrogen, Russia) to minimize representation of common transcripts before being processed for pyrosequencing in a 454 GS FLX sequencer.

cDNA Data Analysis for GS FLX Reads

The adaptor sequences used for cDNA library construction were identified by cross match (http://www.phrap.org; verified 19 July 2010), with parameters of minmatch 12 and minscore 25, and trim positions are changed in sff files using sff tools from Roche (https://www.roche-applied-science.com; verified 19 July 2010) and in-house Java scripts. Short sequences (less than 50 nt) and homopolymer reads (read in which 60% over the entire length of the read is represented by one nucleotide) are filtered by custom Java scripts.

Assembly and Annotation of 454 Reads

Assembly was done with gsAssembler (Newbler)v. using the modified sff files and default parameters of minimum overlap length of 40 nt and minimum overlap identity of 90%. BLAST (Altschul et al., 1990) analyses of contigs and singlets derived from the assembly were conducted against six protein databases: non-redundant proteins, switchgrass, sugarcane (including Saccharum officinarum and Saccharum hybrid cultivar), maize from NCBI (http://www.ncbi.nlm.nih.gov; verified 19 July 2010), sorghum from JGI (http://www.jgi.doe.gov/; verified 19 July 2010), and Poaceae from Gramene (http://www.gramene.org; verified 19 July 2010). Singlets shorter than 100 nt were excluded from BLAST analyses. All BLAST searches were done with an E-value cutoff of 0.00001. Top1 hits from all BLAST results were parsed for annotation and further analysis. The gene ontology (GO) database for Poaceae from Gramene (http://www.gramene.org) was downloaded, which includes GO IDs and terms assigned to each Poaceae protein. GO annotation was done on the basis of BLAST results against the Poaceae database. The results from the BLAST of contigs and singletons of cordgrass against Poaceae proteins were loaded into a database along with Poaceae GO data. Each contig or singleton was assigned the same GO terms as that of the Poaceae protein to which it had homology. This assignment was done by means of database queries and in-house scripts.

Comparison to Sorghum bicolor Genome

Sorghum genome assembly Sbi1.4 (http://www.phytozome.net; verified 19 July 2010) was used for alignment with PCG contig sequence data. The sorghum genome data was used to create a Blast database, which was then queried with contig and EST sequence data. The output was parsed to identify significant hits with an E-value of <e−20. Hits with applicable values were sorted and binned into 1-Mb regions, allowing the mapping of the PCG contigs and ESTs to the sorghum genome with the program Circos (Krzywinski et al., 2009). Contig sequences were counted multiple times when they had multiple hits to the same 1-Mb region.

EST–SSR Identification and Analysis

All large contigs (defined as being >500 bp) were screened with BatchPrimer3 (You et al., 2008), which identified SSR regions and subsequently designed primers to flank the repeat sequences. Thirty contigs were chosen at random, and primers were synthesized for further investigation (Table 1). Eight different genotypes were chosen for evaluation of the EST–SSR primers. This subsample of the germplasm collection contained the four plants which are parents to a mapping population and four other types chosen because of their diverse geographical origin. All PCG plants were grown under standard greenhouse conditions. DNA was extracted from all examined plants by the method as described by Karakousis and Langridge (2003), with minor modifications. Evaluation of the primers was performed in a PCR reaction consisting of 2U Taq DNA polymerase (GoTaq; Promega, Madison, WI), 1× PCR Buffer (GoTaq), 2 mM MgCl2, 0.6 mM dNTPs, and 0.6 mM of each primer, with a final volume of 20 μL. PCR reactions were performed in a BioRad MyCycler (BioRad, Hercules, CA). Thermocycling conditions were as follows: an initial denaturation at 94°C for 5 min, followed by 35 cycles of 94°C for 1 min, 53 or 55°C for 1 min, 72°C for 1 min, followed by an extension step of 72°C for 10 min, and a 10°C hold. PCR product was visualized on 8% (w/v) nondenaturing polyacrylamide gel electrophoresis (PAGE). Bands were scored and sized by comparison to a 100-bp ladder with AlphaEaseFC Software (Alpha Innotech, San Leandro, CA). Contigs which contained EST–SSR regions which aligned to rice were subjected to analysis with electronic PCR (ePCR) to determine if any cross species amplification could occur (Rotmistrovsky et al., 2004).

View Full Table | Close Full ViewTable 1.

PCG EST–SSR primer list. Results in normal text indicate alignment to maize and in bold, italic text indicate alignment to sorghum.

Contig SSR Motif Estimated size Polymorphic Annealing temperature Primer BLAST results <e−20
bp °C
00687 (TC)7 184 yes 55 F GCCTTCTCATCCTTCTTGG endothelial differentiation-related factor 1 (EU961228)
01896 (AG)9 172 no 55 F TGCAGTGTCATGTGACTTTT metal ion binding protein (EU958179)
R ACAGGCTGCTCTACCTAACA similar to Heavy metal-associated domain containing protein
02282 (GCC)6 231 no 55 F TGGAACTCGTACATCAAGAA P-type R2R3 Myb protein (AF470079)
R CATGTCGTTGTAGCTTTCAG similar to Myb factor
04363 (CAT)5 169 yes 55 F TCAACACCTTCTCTGTCTCC dnaJ domain containing protein (NM_001154772)
R CAGCTCGTACAGGTCGTAG similar to H0103C06.7 protein
04382 (CTG)5 163 yes 55 F CTGCCGCAACTTACAAAG
R CGGAGTTCATCGACTTCTT similar to No apical meristem protein, expressed
06305 (TA)6 171 yes 55 F GCAGCAACAATACATGAAGA Zea mays nicotinate phosphoribosyltransferase-like protein (NM_001159021)
R ACTGAAGTCTCCGCATGA similar to OSJNBa0042L16.16 protein
06349 (CT)8 191 yes 55 F GGTGATCTGATCTTGCTGTT cytokinin dehydrogenase 10 (NM_001153366)
07313 (GT)7 155 yes 55 F CATTTCCTGGTGCATTATTA 60S ribosomal protein L37 (NM_001155742)
08011 (GCG)3 101 yes 55 F GCTCAACGCCTACTTCAA
R CTCCATAAATTCCGGGTAG similar to Oxidoreductase, 2OG-Fe oxygenase family protein
09063 (CGG)5 184 yes 55 F TCTTGGTGCTCTTGCAGTA dihydroflavonol-4-reductase (NM_001158455)
R GAGATCATCGAGCCCATC similar to Putative cinnamoyl-CoA reductase
R GCTCTTGTACTCCCTCTCCT similar to Putative RNA-binding protein
R ATGTCGAGGAAGAGGAAGA similar to Putative uncharacterized protein P0461D06.29
R TCCTGCTCCTGTAAGTTCAC similar to Protein kinase domain containing protein
16764 (CAG)5 171 yes 55 F CGGTACCGGAAGAAGAGA
R AGGTCAAGCACTCGTTCAAG similar to Histone H1-like protein
R TGTCGTACAACTCGCTGTC similar to Polygalacturonase inhibiting protein-like
R CGAGTAGCTGTCCACCAC similar to Os05g0120300 protein
R CTGATCCGAGCTGAACTG similar to Associated with HOX family protein, expressed
20797 (CCG)4…(GAT)5 550 yes 55 F TCCCTCCTGAGTCTACTCCT transcription factor BTF3 (EU956752)
R AAGTCCTCAACACCATCATC similar to Acidic leucine-rich nuclear phosphoprotein 32-related protein 1
22367 (GA)7 231 yes 55 F GGAAGGAAGGAGACGAAC polyphosphoinositide binding protein Ssh2p (EU966771)
R TGACGTACTTCACCAGCAT similar to Putative phosphatidylinositol/phosphatidylcholine transfer protein
24500 (CA)9 168 no 55 F ACCCTGGAGTCACAAATAAA autophagy-related 4 (Atg4a) (NM_001144016)
R GGTTAAGAACGATGACCTTG similar to Cysteine protease ATG4B
25164 (TC)9 173 yes 55 F GCGAACAGGTACAGAAACAC membrane protein (NM_001155277)
R AGCAAGTTCAACCACGTCT similar to Auxin-induced protein-like
R TTGTAGCAGGGCTCTATACC similar to Digalactosyldiacylglycerol synthase 1, putative, expressed
25412 (ATG)5 165 yes 55 F AGCAGTACCAGGGAAGCTAT pro-resilin precursor (EU959970)

Sequences Related to Lignin and Cellulose Biosynthesis

From the annotation of contigs against the genome databases, contigs that aligned to genes from the lignin and cellulose biosynthetic pathway were identified. The contigs that were identified as being caffeic acid 3-O-methyltransferase (COMT) and caffeoyl-CoA O-methyltransferase (CCoAOMT) were used to perform a BLASTx against the National Center for Biotechnology Information (NCBI) nonredundant protein sequence database. The top hits from O. sativa, Z. mays, and S. bicolor with an E-value of <e−20 were used as a basis of alignment for phylogenetic analysis; two accessions identified from wheat and one from Brachypodium sp. with E-values of <e−20 were also included in the analysis. All contig sequences were transcribed and, in addition to the NCBI sequences, were trimmed for optimal alignment. Phylogenetic and molecular evolutionary analyses were conducted by MEGA version 4 (Tamura et al., 2007). Pairwise alignment of the amino acid sequences was performed by ClustalW (Larkin et al., 2007). The construction of the phylogenetic tree was performed by the Neighbor joining method with pairwise deletion.


Summary Statistics—Sequence Assembly and Composition

From the 454 GS FLX sequencing runs performed by the W.M. Keck Center, a total of 123,983,092 bp from a cDNA library was obtained, forming 556,198 sequence reads, with an average read length of 223 bp. Removal of reads that did not fulfill the minimum quality standards (>60% homopolymers and <50 bp long) resulted in 532,626 sequence reads. Assembly of the screened sequences resulted in the formation of 26,302 contigs from 343,631 (65%) of the reads. The contigs have an average length of 394 bp and an average depth of coverage of 20.3 reads per basepair. The size of each contig was plotted against the number of reads of each contig on a logarithmic scale (Fig. 1). A trend of an increase in the number of reads resulting in an increase in the length of the contig was observed with a low R2 value (R2 = 0.1624). The top 10% of contigs by length totaled 2630, with a size range of 761 to 2117 bp, an average contig size of 990 bp, and an average depth of coverage of 58.8 reads per base pair. The remaining quality screened ESTs formed 71,103 singletons (coverage depth = 1) with an average length of 200 bp. In total, 97,405 unigenes were assembled. The assembled contig and the singleton sequences are available (see Supplementary Fig. S1 and S2, respectively). The top 10 contigs, based on number of reads, were aligned by BLASTx to the NCBI database (Table 2), six had significant (E-value of <e−20) hits to different genes while four aligned to “hypothetical proteins”. The PCG genome is four to five times larger than the rice genome (J. Gonzalez, unpublished data), with an estimated size of 1556 to 1945 Mb. In total, all unigenes contained 24,588,878 bp of sequence, providing a theoretical coverage of 1.3 to 1.6% of the PCG genome.

View Full Table | Close Full ViewTable 2.

Top ten contigs determined by depth of coverage (No. reads) and their respective alignments.

Contig Length (bp) No. reads BLASTx Result
22512 805 1704 Extracellular ribonuclease (Z. mays)
18270 1001 730 MADS-box transcription factor (T. aestivum)
25744 718 680 Auxin-induced protein (Z. mays)
16722 960 674 Nucleotide pyrophosphatase (O. sativa)
23305 376 659 Granule-bound starch synthase pseudogene (Z. mays)
18651 367 637 Hypothetical protein
26205 446 636 Hypothetical protein
23588 1488 606 Alcohol dehydrogenase (O. sativa)
24174 868 596 Hypothetical protein
16576 182 580 Hypothetical protein
Figure 1.
Figure 1.

Relationship between length of each contig (bp) and depth of coverage of each contig (number of reads).


Annotation of the 454 Assembly

Comparison of PCG contigs with NCBI nonredundant database entries showed that a large number of contigs align with gene sequences from S. bicolor, Z. mays, S. officinarum, P. virgatum, and other members of the Poaceae family (Table 3). The species which had the most PCG contigs significantly (E-value of <e−20) align was sorghum, while the least was switchgrass (Table 3). Whether this results is due to actual divergence of the species examined or to the current genomic knowledge of these species is unclear. All unigenes were examined for their association with GO terms from levels 1 to 4 of molecular, cellular, and biological categories, with 45,810 (47.0%), 48,905 (50.2%), and 48,852 (50.2%) associated, respectively. The GO biological terms were further grouped into 15 defined categories and their relative abundance was graphed (Fig. 2). Upon examination of library overlaps of contigs with an E-value of <e−20, no unique sequences were identified from the NCBI nonredundant database. When library overlaps were compared with the contigs from the other prevalent associated databases (Poaceae, S. bicolor, and Z. mays) and the contigs that have been assigned a GO term, unique sequences were found to occur in all four libraries (Fig. 3), with the most occurring in the GO library and the least within the Z. mays library. A total 54.3% of all contigs could be annotated on the basis of one of the four categories used for development Fig. 3. Only 16,661 or 22.5% of singletons could be aligned to the same four categories (data not shown).

View Full Table | Close Full ViewTable 3.

Alignment of PCG contigs to other databases from closely related and non-related species. Significance level at E-value <e−20.

Database Contigs which aligned significantly Total number of contigs
NCBI non-redundant 12,498 17,221
Poaceae 12,487 17,541
Sorghum bicolor 12,690 17,506
Zea mays 10,782 15,486
Saccharum officinarum 522 1119
Panicum virgatum 15 31
Figure 2.
Figure 2.

Arbitrary enumeration of the 15 aggregates of the biological process gene ontology functional classification of PCG contigs.

Figure 3.
Figure 3.

Overlap of the alignment of PCG contigs to the three grass libraries and gene ontology classification. The total number of contigs are indicated for each species or classification and their respective intersection.


Comparison to Sorghum Genome

To further ascertain the genome coverage of this PCG transcriptome analysis, all contigs and singletons were aligned to the S. bicolor genome (Fig. 4). Similar to the research performed on switchgrass (Tobias et al., 2008), the S. bicolor genome was chosen for comparison because it is believed to be the species most closely related to PCG with a comprehensive sequenced genome. A total of 13,892 (52.8%) of the contig sequences matched genome sequences with an E-value of <e−20, forming the 10 putative chromosomes (Fig. 4). There are 139 contig sequences mapped to supercluster regions with an E-value of <e−20, and these contigs are displayed as the 17 unique regions that were not associated with the 10 putative chromosomes (Fig. 4).

Figure 4.
Figure 4.

PCG contigs and ESTs mapped to the sorghum genome. The 10 individual sorghum chromosomes (multicolored) are shown in association with the respective PCG sequence. An additional 17 clusters that did not assemble with the sorghum genome are represented as mint green. The positive strand is represented by the outward clusters of contig frequency (red) and EST frequency (blue).The negative strand is represented by the inward clusters of contig frequency (yellow) and EST frequency (green).


EST–SSR Marker Development

Screening all the large Contigs (>500 bp long, n = 6489) for SSR regions with BatchPrimer 3.0 identified a total of 841 SSR regions. These were found in 704 contigs with a frequency of 3.2%. Among the 841 SSR regions, were di-, tri-, tetra-, penta-, and hexanucleotide repeat motifs. GC-rich repeats were identified with the most abundant regions being CCG/GGC, CGC/GCG, and CGG/GCC, which accounted for 18.5% of all the identified SSR regions. A random set of 20 primer sets were designed, encompassed SSR regions, and were used to amplify PCG genomic regions. All 20 primer sets produced amplicons with varying degrees of polymorphism, ranging from monomorphic single bands to a maximum of 22 amplicons per reaction (Fig. 5). The 20 EST–SSR regions which were amplified were aligned to determine their best match to both the S. bicolor and Z. mays (Table 1). Seven of the 20 EST–SSR regions could not be associated with currently known protein coding regions of S. bicolor or Z. mays but were associated with transcripts of unknown function.

Figure 5.
Figure 5.

Example of the observed polymorphism of PCG EST–SSRs in eight different genotypes of PCG collected from South Dakota, North Dakota, Iowa, and Nebraska. PCR products were visualized on an 8% (w/v) nondenaturing PAGE. Contigs from which primers were developed were as follows: a. contig17168, b. contig16969, c. contig16767, d. contig16063, e. contig15096. The figure contains a 100-bp ladder for reference (far right).


Among the EST–SSRs, a total of 48 contigs contained di- and trinucleotide repeat motifs of longer than 15 nucleotides. A BLASTn alignment to the NCBI database of all 48 contigs identified six contigs with no significant alignment to any grass species, 14 contigs that aligned significantly (E-value of <e−20) to sequences from other grass species but not to the SSR region, and 28 with significant alignment (E-value of <e−20) to sequences from other grass species including the SSR regions (Table 4). Of the 28 contigs that showed alignment to other grass SSR regions, only six contained SSR regions identified as being highly conserved in grass species (Kantety et al., 2002). Of the 28 EST–SSR regions that aligned to similar SSR regions in other grass species, 18 aligned to sequences in rice, and their corresponding primer sequences were then used to perform ePCR (Rotmistrovsky et al., 2004). Only three PCG EST–SSR primers were found to produce potential PCR products. These were from contigs designated as 00531 and 25016, which each produced one potential PCR product (from NM_001049469 and NM_001068592, respectively) and 20426, which produced two potential products that were aligned to two rice sequences (NM_001056341 and NM_001061810).

View Full Table | Close Full ViewTable 4.

Contigs containing di- and trinucleotide EST–SSR regions longer than 15 bp and their respective alignment to other grass species.

Contig Rice Sorghum Maize Wheat Barley Bamboo Other Contains SSR SSR motif
contig00099 No significant alignments
contig00531 yes GAC/CTG, TGA/ACT
contig01896 Panicum virgatum
contig02142 yes GCT/CGA
contig02282 yes GCC/CGG
contig02898 yes ACC/TGG
contig03228 yes GAC/CTG
contig03778 Puccinellia tenuiflora
contig03847 yes AGA/TCT
contig03870 yes AGC/TCG
contig03953 yes TCC/AGG
contig07196 yes CCG/GGC
contig08102 yes AAC/TTG
contig08322 yes CAG/GTC
contig08938 yes CGG/GCC
contig09510 Eragrostis tef
contig09517 No significant alignments
contig09843 Lolium perenne yes GCA/CGT
contig11685 yes CTT/GAA
contig12311 No significant alignments
contig16180 No significant alignments
contig17168 yes CAC/GTG
contig18725 yes TCA/AGT
contig18766 yes GGC/CCG
contig18981 yes GGC/CCG, CAG/GTC
contig19448 Saccharum officinarum yes GCG/CGC
contig19955 yes GCT/CGA
contig20041 yes TCA/AGT
contig20426 yes GCG/CGC
contig20737 yes GTC/CAG
contig21486 yes CTC/GAG
contig22231 No significant alignments
contig23465 No significant alignments
contig23487 yes TC/AG
contig23868 yes CTT/GAA
contig25016 yes GTT/CAA
contig25633 yes TC/AG
Phyllostachys edulis (Carrière) J.Houz. and Dendrocalamus latiflorus Munro.
ePCR amplification to rice sequence NM_001049469.
§ePCR amplification to rice sequence NM_001056341 & NM_001061810.
ePCR amplification to rice sequence NM_001068592.

Sequences Related to Lignin and Cellulose Biosynthesis

On the basis of the lignin or phenylpropanoid biosynthetic pathway as determined by Humphrey and Chapple (2002), gene families for seven phenylpropanoid pathway enzymes were identified from the annotations, specifically annotations from S. bicolor, Z. mays, Poaceae, and S. officinarum. These seven enzymes were caffeic acid 3-O-methyltransferase (COMT), cinnamate-4-hydroxylase (C4H), phenylalanine ammonia lyase (PAL), cinnamyl-alcohol dehydrogenase (CAD), caffeoyl-CoA 3-O-methyltransferase (CCoAOMT), 4-(hydroxyl)cinnamoyl CoA ligase (4CL), and cinnamoyl-CoA reductase (CCR). The phenylpropanoid enzyme in PCG which had the most annotations was 4CL, followed by CAD, with the remaining five in relatively equal amounts (Table 5). Sequences, which align to the remaining enzymes in the phenylpropanoid pathway, as described by Humphrey and Chapple (2002) were not identified in this analysis.

View Full Table | Close Full ViewTable 5.

Lignin biosynthetic pathway enzymes identified from the annotation of the PCG transcriptome.

S. bicolor Z. mays Poaceae S. officinarum
COMT 0 1 1 2
CCoAOMT 5 4 1 0
C4H 0 0 1 0
PAL 4 2 3 6
CAD 3 1 3 10
C4L 29 4 5 0
CCR 4 0 4 0

Gene families associated with the production of cellulose for cell walls were also identified from the annotations of the PCG transcriptome. The cellulose gene families identified within the PCG transcriptome can be grouped according to their function into cellulose synthases (CesA), cellulose synthase-like genes (Csl), glycosyl transferases, and callose synthase genes. Members of the CesA gene family were the most abundant of the cellulose biosynthetic pathway genes (data not shown).

From the seven phenylpropanoid enzymes (lignin biosynthesis), contigs relating to the enzymes COMT and CCoAOMT were further examined; these two enzymes are members of the S-adenosylmethionine dependent methyl transferase (AdoMet-MTase), class I superfamily. From the contig annotations, three COMT sequences were identified: contig13867 (from both Z. mays and Poaceae), contig03448, and contig16262 (from S. officinarum). From the contig annotations five CCoAOMT sequences were identified: contig04815, contig06137, contig06523, contig18666 (from both S. bicolor and Z. mays), and contig21502 (from both S. bicolor and Z. mays). The relative length in base pairs of each contig and its corresponding translated protein length in amino acids are displayed in Table 6. The five contigs were used to perform a BLASTx search of the NCBI database and identified similar regions in accessions from Z. mays, S. bicolor, O. sativa, T. aestivum, and Brachypodium sp. Contig 18666 was not long enough to be able to perform a BLASTx search and was dropped from further phylogenetic analysis at that point. A total of 67 accessions (Table 7) were further utilized to perform phylogenetic analysis (Fig. 6). The phylogenetic analysis grouped the accessions into seven groups, corresponding to the contigs, with two partial outliers.

View Full Table | Close Full ViewTable 6.

Length of contigs found to align to enzymes Caffeic acid 3-O-methyltransferase (COMT) and Caffeoyl-CoA 3-O-methyltransferase (CCoAOMT).

Contig Annotation Enzyme Length (bp) Translated length (aa)
03448 S. officinarum COMT 613 122
04815 S. bicolor, Z. mays CCoAOMT 942 229
06137 S. bicolor, Z. mays CCoAOMT 558 170
06523 S. bicolor, Z. mays CCoAOMT 334 76
13867 Z. mays, Poaceae COMT 259 69
16262 S. officinarum COMT 1192 224
18666 S. bicolor, Z. mays CCoAOMT 192 N/A
21502 S. bicolor, Poaceae CCoAOMT 998 260

View Full Table | Close Full ViewTable 7.

Accessions used to perform phylogenetic analysis with the contigs identified as being the enzymes COMT and CCoAOMT.

Designation NCBI Accession Species Phylogentic group
1 CAJ26379 Brachypodium sp. 5
2 AAP37878 Z. mays 4
3 AAP37879 4
4 AAP37881 4
5 AAP37882 4
6 AAP37884 4
7 AAP37885 4
8 AAQ24337 3
9 AAQ24338 3
10 AAQ24339 3
11 AAQ24342 3
12 AAQ24343 3
13 AAQ24344 3
14 AAQ24361 3
15 AAQ89900 4
16 ACG33694 1
17 ACG36166 5
18 ACG37598 3
19 ACG47259 2
20 ACN35776 1
21 NP_001106047 3
22 NP_001132142 2
23 NP_001140567 1
24 NP_001140761 1
25 NP_001148593 1
26 NP_001150570 7
27 NP_001150654 2
28 NP_001151485 4
29 NP_001152451 6
30 NP_001152511 6
31 Q06509 3
32 Q9XGD5 4
33 BAD14923 O. sativa 3
34 EAY80724 2
35 EAY99818 4
36 EAZ05653 3
37 EAZ09516 5
38 EAZ20379 1
39 EAZ41571 3
40 EAZ43214 6
41 EEE64356 1
42 NP_001056039 1
43 NP_001056040 1
44 NP_001056910 4
45 NP_001061012 7
46 NP_001061031 3
47 NP_001062142 6
48 NP_001062144 outlier
49 NP_001063495 5
50 NP_001067748 2
51 NP_001067749 2
52 AAL57301 S. bicolor 3
53 AAO43609 3
54 XP_002436550 4
55 XP_002441374 1
56 XP_002441379 1
57 XP_002444017 1
58 XP_002444595.1 1
59 XP_002444815 outlier
60 XP_002444818 6
61 XP_002445083 3
62 XP_002450036 2
63 XP_002450037 2
64 XP_002450658 2
65 XP_002462549 5
66 CAJ19350 T. aestivum 5
67 CAP72304 2
Figure 6.
Figure 6.

Neighbor joining unrooted tree constructed from the pairwise distance between the protein sequences of 82 phenylpropanoid biosynthetic pathway enzymes. Clustering occurred in relation to the three contigs annotated as the enzyme COMT and five contigs annotated as CCoAOMT and their respective NCBI accessions identified by BLASTx (clades are designated by number). Clades 1, 2, and 3 represent the sequences associated with the enzyme COMT. Clades 4, 5, 6, and 7 represent the sequences associated with the enzyme CCoAOMT. Accession names are detailed in Table 7.



Before this research, limited investigations had been performed on the PCG genome and transcriptome analysis. The results of this investigation into the transcriptome of PCG have vastly increased our general knowledge of this uncharacterized genome. Potential limitations will be inherent in any PCG transcriptome analysis because of the polyploid nature of the species and potential paralogy of gene families. Application of the normalization techniques before cDNA sequencing has resolved a lot of these inherent problems. The success of the normalization of the cDNA used in this analysis is demonstrated by the alignments of the 10 contigs with the most reads.

The most frequently expressed enzyme in plants is RuBisCo (ribulose-1,5-bisphosphate carboxylase oxygenase). Absence of sequences coding for RuBisCo in the top 10 contigs indicates the effectiveness of the normalization technique. The depth of coverage for contigs depends on the extent of normalization. It was not exceptional for the PCG transcriptome analysis with 54.3 and 22.5% of the PCG contigs and singletons, respectively, to be associated with a function via alignment. In the transcriptome analysis of another nonmodel organism, Melitaea cinxia (L.), Vera et al. (2008) noted a similar depth variation among contigs, while rarer transcripts were often hard to find.

Additional potential challenges arise in the investigation of the PCG genome because of the lack of information about this nonmodel species. Sorghum appears to be the most closely related to PCG from among all of the species whose genome sequences are currently known or nearly so. This assumption is supported by the most PCG contigs being annotated to the sorghum genome. By aligning the PCG contigs and singletons to the sequences of sorghum, a map indicating the distribution of the PCG sequence was obtained. This map indicates a balanced distribution of the PCG sequences across the sorghum genome in addition to the presence of unique PCG sequences. This comparison and association between PCG and sorghum validates the utilization of sorghum as a model system for further PCG genome comparisons.

PCG transcriptome analysis has identified numerous markers ready for use in various applications, from gene identification to mapping. One class of markers identified were GC-rich EST–SSRs that are very similar to other grass species (Kantety et al., 2002; Tobias et al., 2008). While PCG showed a similar frequency of overall EST–SSRs to other grass species, the motifs occurred at different frequencies (Kantety et al., 2002). The EST–SSR markers that were tested in this study showed varying degrees of polymorphism, from monomorphic to highly polymorphic. It is of interest to note that the EST–SSR marker that was monomorphic was designed from a primer that aligned to a lignin biosynthetic pathway gene, indicating that while this sequence may fit the criteria for an SSR, an increase or decrease of the SSR region may have deleterious effects on the functional expression of this sequence. Furthermore, by utilizing known genes from closely related species such as sorghum, the identification of putative genes associated with the cell wall biosynthetic pathways have been identified. Examination of these expressed sequences across diverse genotypes of PCG could provide information suitable for breeding and other studies. Utilizing ePCR, identified some of the EST–SSRs that could produce cross-species amplification, specifically with rice. Other cross-species amplification could be possible but would require actual PCR since the status of the remaining grass genome sequence is not to the same level as rice and subsequently cannot currently be used for ePCR.

Phylogenetic analysis of the contigs that were annotated as being either of the phenylpropanoid (lignin) biosynthetic pathway enzymes, COMT or CCoAOMT, provided additional validation of the assembly of the 454 GS FLX sequence reads. Each of the seven phenylpropanoid contigs clustered separately from each other, with the first branch occurring between the two enzymes COMT and CCoAOMT. This first major branch between the two highly related enzymes indicates that the assembly provides detailed information which will allow the identification of additional genes of interest. The specificity of the assembly is particularly demonstrated by contig06523 and contig13867, which were the smallest of all the translated amino acid sequences with less than 100 amino acids each, both clustered individually. While the separate clustering of all seven contigs may indicate gene families present within the phenylpropanoid biosynthetic pathway, none of the seven contigs appeared to cover an entire gene; for example, the two small contigs appeared to be from the C-terminus region of the protein. Further investigations need to be performed to determine if these seven contigs are from separate enzyme coding genes.


This research reports the first investigation of PCG transcriptome sequencing in a 454 GS FLX, which has provided a base and framework for expression and genome analysis in this species. The comprehensive data set from this research will assist in elucidating further genomic information and genetic improvement for this species and ultimately its genus and the Chloridoideae subfamily as a whole. A number of molecular markers suitable for the development of molecular maps, gene identification, and comparative genomics studies have been identified for this nonmodel species. The results from this research will be utilized in collaboration with other work being performed to turn PCG into a viable cellulosic biomass crop.


The authors acknowledge support from the Joint USDA-DOE Feedstock Genomics program (grant# 2007-35504-18236), the Center of Excellence in Drought Tolerance through the South Dakota 2010 Initiative, and the South Dakota Agri. Exp. Station.




  • All rights reserved. No part of this periodical may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Permission for printing and for reprinting the material contained herein has been obtained by the publisher.


Be the first to comment.

Please log in to post a comment.
*Society members, certified professionals, and authors are permitted to comment.