On the basis of these favorable growth characteristics, research attention is being directed toward the development of PCG as a cellulosic biomass crop species. Indubitably, genetic improvements will be made to PCG to enhance this species for utilization as a source for biomass. These genetic changes could occur via conventional breeding and marker assisted selection breeding or by incorporation of favorable genes (or characteristics) through plant transformation techniques. To facilitate any genetic improvement in PCG, comprehensive knowledge of the genome and its gene expression profiling is essential. Limited genetic studies have been performed in PCG to date. Genetic diversity of populations of PCG with amplified fragment length polymorphisms (AFLP) has been studied (Moncada et al., 2007; J. Gonzalez, unpublished data). The development of molecular markers for breeding can be achieved rapidly by utilizing techniques such as enriched genomic libraries and parallel pyrosequencing techniques. Both of these methods can provide considerable amounts of information about the genome structure of this species. In PCG, genomic libraries enriched for SSRs have been developed and utilized to compile a linkage map (J. Gonzalez, unpublished data).
Many variables will need to be quantified and potentially altered to ensure optimal production of biofuel from PCG. The identification of beneficial alleles associated with these potential variables, via methods used in the more examined crop species (i.e., wheat, Triticum aestivum L., and rice, Oryza sativa L.), may not be feasible or viable in PCG. Attention may be turned toward the use of comparative and functional genomics approaches to identify putative genes in PCG. To achieve this goal an understanding of the relatedness between PCG and other model grass species needs to be undertaken. Such studies have been conducted previously by means of expressed sequence data in switchgrass, another species of interest for biofuel production, and in the model system of sorghum [Sorghum bicolor (L.) Moench] (Tobias et al., 2008).
The objective of the work presented in this paper was to investigate the transcriptome of PCG. This analysis provides the first snapshot of the PCG transcriptome. This was achieved by the sequencing of cDNA isolated from four stages of PCG growth: root, rhizome, hook, and immature inflorescence in a 454 GS FLX sequencer. The sequencing of the PCG cDNA provided a library of ESTs which was assembled de novo to produce consensus sequences (contigs) with minimal redundancy. This paper describes an initial analysis of the transcriptome library detailing the identification of; molecular markers (SSRs and EST–SSRs), the mining of gene sequence (specifically the lignin biosynthetic pathway), and comparative genomic analysis to other species [S. bicolor, Z. mays, P. virgatum, sugarcane (Saccharum sp.)], the Poaceae family, and a nonredundant database.
MATERIALS AND METHODS
From a collection of PCG germplasm, three lines were selected for RNA extraction. These lines were grown in standard greenhouse conditions at South Dakota State University (Brookings, SD). From one line (designated RR21), a section of roots, rhizome, and “hooks” were collected, while from two other lines (designated Sp3.1A and Sp13.4A), immature inflorescences were collected. Line RR21 was selected because it has been used to develop a segregating population. The other two lines (Sp3.1A and Sp13.4A) were chosen because they were the only members of our germplasm collection that were flowering at the time of collection. The four tissue types were immediately flash-frozen in liquid nitrogen on collection and stored at −80°C until use. The four tissues were transported to the W.M. Keck Center for Comparative and Functional Genomics at the University of Illinois (Urbana, IL) on dry ice where total RNA was extracted from these tissue samples by Pure Link Plant RNA extraction kit (Invitrogen, Carlsbad, CA). mRNAs were isolated from individual samples by the Oligotex Kit (Qiagen, Valencia, CA). Equal amounts of mRNA from different samples were pooled, and cDNA was synthesized from pooled mRNA. cDNA was then normalized by means of a Trimer-Direct kit (Evrogen, Russia) to minimize representation of common transcripts before being processed for pyrosequencing in a 454 GS FLX sequencer.
cDNA Data Analysis for GS FLX Reads
The adaptor sequences used for cDNA library construction were identified by cross match (http://www.phrap.org; verified 19 July 2010), with parameters of minmatch 12 and minscore 25, and trim positions are changed in sff files using sff tools from Roche (https://www.roche-applied-science.com; verified 19 July 2010) and in-house Java scripts. Short sequences (less than 50 nt) and homopolymer reads (read in which 60% over the entire length of the read is represented by one nucleotide) are filtered by custom Java scripts.
Assembly and Annotation of 454 Reads
Assembly was done with gsAssembler (Newbler)v.2.0.00.22 using the modified sff files and default parameters of minimum overlap length of 40 nt and minimum overlap identity of 90%. BLAST (Altschul et al., 1990) analyses of contigs and singlets derived from the assembly were conducted against six protein databases: non-redundant proteins, switchgrass, sugarcane (including Saccharum officinarum and Saccharum hybrid cultivar), maize from NCBI (http://www.ncbi.nlm.nih.gov; verified 19 July 2010), sorghum from JGI (http://www.jgi.doe.gov/; verified 19 July 2010), and Poaceae from Gramene (http://www.gramene.org; verified 19 July 2010). Singlets shorter than 100 nt were excluded from BLAST analyses. All BLAST searches were done with an E-value cutoff of 0.00001. Top1 hits from all BLAST results were parsed for annotation and further analysis. The gene ontology (GO) database for Poaceae from Gramene (http://www.gramene.org) was downloaded, which includes GO IDs and terms assigned to each Poaceae protein. GO annotation was done on the basis of BLAST results against the Poaceae database. The results from the BLAST of contigs and singletons of cordgrass against Poaceae proteins were loaded into a database along with Poaceae GO data. Each contig or singleton was assigned the same GO terms as that of the Poaceae protein to which it had homology. This assignment was done by means of database queries and in-house scripts.
Comparison to Sorghum bicolor Genome
Sorghum genome assembly Sbi1.4 (http://www.phytozome.net; verified 19 July 2010) was used for alignment with PCG contig sequence data. The sorghum genome data was used to create a B
EST–SSR Identification and Analysis
All large contigs (defined as being >500 bp) were screened with BatchPrimer3 (You et al., 2008), which identified SSR regions and subsequently designed primers to flank the repeat sequences. Thirty contigs were chosen at random, and primers were synthesized for further investigation (Table 1). Eight different genotypes were chosen for evaluation of the EST–SSR primers. This subsample of the germplasm collection contained the four plants which are parents to a mapping population and four other types chosen because of their diverse geographical origin. All PCG plants were grown under standard greenhouse conditions. DNA was extracted from all examined plants by the method as described by Karakousis and Langridge (2003), with minor modifications. Evaluation of the primers was performed in a PCR reaction consisting of 2U Taq DNA polymerase (GoTaq; Promega, Madison, WI), 1× PCR Buffer (GoTaq), 2 mM MgCl2, 0.6 mM dNTPs, and 0.6 mM of each primer, with a final volume of 20 μL. PCR reactions were performed in a BioRad MyCycler (BioRad, Hercules, CA). Thermocycling conditions were as follows: an initial denaturation at 94°C for 5 min, followed by 35 cycles of 94°C for 1 min, 53 or 55°C for 1 min, 72°C for 1 min, followed by an extension step of 72°C for 10 min, and a 10°C hold. PCR product was visualized on 8% (w/v) nondenaturing polyacrylamide gel electrophoresis (PAGE). Bands were scored and sized by comparison to a 100-bp ladder with AlphaEaseFC Software (Alpha Innotech, San Leandro, CA). Contigs which contained EST–SSR regions which aligned to rice were subjected to analysis with electronic PCR (ePCR) to determine if any cross species amplification could occur (Rotmistrovsky et al., 2004).
|Contig||SSR Motif||Estimated size||Polymorphic||Annealing temperature||Primer||BLAST results <e−20|
|00687||(TC)7||184||yes||55||F||GCCTTCTCATCCTTCTTGG||endothelial differentiation-related factor 1 (EU961228)|
|01896||(AG)9||172||no||55||F||TGCAGTGTCATGTGACTTTT||metal ion binding protein (EU958179)|
|R||ACAGGCTGCTCTACCTAACA||similar to Heavy metal-associated domain containing protein|
|02282||(GCC)6||231||no||55||F||TGGAACTCGTACATCAAGAA||P-type R2R3 Myb protein (AF470079)|
|R||CATGTCGTTGTAGCTTTCAG||similar to Myb factor|
|04363||(CAT)5||169||yes||55||F||TCAACACCTTCTCTGTCTCC||dnaJ domain containing protein (NM_001154772)|
|R||CAGCTCGTACAGGTCGTAG||similar to H0103C06.7 protein|
|R||CGGAGTTCATCGACTTCTT||similar to No apical meristem protein, expressed|
|06305||(TA)6||171||yes||55||F||GCAGCAACAATACATGAAGA||Zea mays nicotinate phosphoribosyltransferase-like protein (NM_001159021)|
|R||ACTGAAGTCTCCGCATGA||similar to OSJNBa0042L16.16 protein|
|06349||(CT)8||191||yes||55||F||GGTGATCTGATCTTGCTGTT||cytokinin dehydrogenase 10 (NM_001153366)|
|07313||(GT)7||155||yes||55||F||CATTTCCTGGTGCATTATTA||60S ribosomal protein L37 (NM_001155742)|
|R||CTCCATAAATTCCGGGTAG||similar to Oxidoreductase, 2OG-Fe oxygenase family protein|
|R||GAGATCATCGAGCCCATC||similar to Putative cinnamoyl-CoA reductase|
|R||GCTCTTGTACTCCCTCTCCT||similar to Putative RNA-binding protein|
|R||ATGTCGAGGAAGAGGAAGA||similar to Putative uncharacterized protein P0461D06.29|
|R||TCCTGCTCCTGTAAGTTCAC||similar to Protein kinase domain containing protein|
|R||AGGTCAAGCACTCGTTCAAG||similar to Histone H1-like protein|
|R||TGTCGTACAACTCGCTGTC||similar to Polygalacturonase inhibiting protein-like|
|R||CGAGTAGCTGTCCACCAC||similar to Os05g0120300 protein|
|R||CTGATCCGAGCTGAACTG||similar to Associated with HOX family protein, expressed|
|20797||(CCG)4…(GAT)5||550||yes||55||F||TCCCTCCTGAGTCTACTCCT||transcription factor BTF3 (EU956752)|
|R||AAGTCCTCAACACCATCATC||similar to Acidic leucine-rich nuclear phosphoprotein 32-related protein 1|
|22367||(GA)7||231||yes||55||F||GGAAGGAAGGAGACGAAC||polyphosphoinositide binding protein Ssh2p (EU966771)|
|R||TGACGTACTTCACCAGCAT||similar to Putative phosphatidylinositol/phosphatidylcholine transfer protein|
|24500||(CA)9||168||no||55||F||ACCCTGGAGTCACAAATAAA||autophagy-related 4 (Atg4a) (NM_001144016)|
|R||GGTTAAGAACGATGACCTTG||similar to Cysteine protease ATG4B|
|25164||(TC)9||173||yes||55||F||GCGAACAGGTACAGAAACAC||membrane protein (NM_001155277)|
|R||AGCAAGTTCAACCACGTCT||similar to Auxin-induced protein-like|
|R||TTGTAGCAGGGCTCTATACC||similar to Digalactosyldiacylglycerol synthase 1, putative, expressed|
|25412||(ATG)5||165||yes||55||F||AGCAGTACCAGGGAAGCTAT||pro-resilin precursor (EU959970)|
Sequences Related to Lignin and Cellulose Biosynthesis
From the annotation of contigs against the genome databases, contigs that aligned to genes from the lignin and cellulose biosynthetic pathway were identified. The contigs that were identified as being caffeic acid 3-O-methyltransferase (COMT) and caffeoyl-CoA O-methyltransferase (CCoAOMT) were used to perform a BLASTx against the National Center for Biotechnology Information (NCBI) nonredundant protein sequence database. The top hits from O. sativa, Z. mays, and S. bicolor with an E-value of <e−20 were used as a basis of alignment for phylogenetic analysis; two accessions identified from wheat and one from Brachypodium sp. with E-values of <e−20 were also included in the analysis. All contig sequences were transcribed and, in addition to the NCBI sequences, were trimmed for optimal alignment. Phylogenetic and molecular evolutionary analyses were conducted by MEGA version 4 (Tamura et al., 2007). Pairwise alignment of the amino acid sequences was performed by ClustalW (Larkin et al., 2007). The construction of the phylogenetic tree was performed by the Neighbor joining method with pairwise deletion.
Summary Statistics—Sequence Assembly and Composition
From the 454 GS FLX sequencing runs performed by the W.M. Keck Center, a total of 123,983,092 bp from a cDNA library was obtained, forming 556,198 sequence reads, with an average read length of 223 bp. Removal of reads that did not fulfill the minimum quality standards (>60% homopolymers and <50 bp long) resulted in 532,626 sequence reads. Assembly of the screened sequences resulted in the formation of 26,302 contigs from 343,631 (65%) of the reads. The contigs have an average length of 394 bp and an average depth of coverage of 20.3 reads per basepair. The size of each contig was plotted against the number of reads of each contig on a logarithmic scale (Fig. 1). A trend of an increase in the number of reads resulting in an increase in the length of the contig was observed with a low R2 value (R2 = 0.1624). The top 10% of contigs by length totaled 2630, with a size range of 761 to 2117 bp, an average contig size of 990 bp, and an average depth of coverage of 58.8 reads per base pair. The remaining quality screened ESTs formed 71,103 singletons (coverage depth = 1) with an average length of 200 bp. In total, 97,405 unigenes were assembled. The assembled contig and the singleton sequences are available (see Supplementary Fig. S1 and S2, respectively). The top 10 contigs, based on number of reads, were aligned by BLASTx to the NCBI database (Table 2), six had significant (E-value of <e−20) hits to different genes while four aligned to “hypothetical proteins”. The PCG genome is four to five times larger than the rice genome (J. Gonzalez, unpublished data), with an estimated size of 1556 to 1945 Mb. In total, all unigenes contained 24,588,878 bp of sequence, providing a theoretical coverage of 1.3 to 1.6% of the PCG genome.
|Contig||Length (bp)||No. reads||BLASTx Result|
|22512||805||1704||Extracellular ribonuclease (Z. mays)|
|18270||1001||730||MADS-box transcription factor (T. aestivum)|
|25744||718||680||Auxin-induced protein (Z. mays)|
|16722||960||674||Nucleotide pyrophosphatase (O. sativa)|
|23305||376||659||Granule-bound starch synthase pseudogene (Z. mays)|
|23588||1488||606||Alcohol dehydrogenase (O. sativa)|
Annotation of the 454 Assembly
Comparison of PCG contigs with NCBI nonredundant database entries showed that a large number of contigs align with gene sequences from S. bicolor, Z. mays, S. officinarum, P. virgatum, and other members of the Poaceae family (Table 3). The species which had the most PCG contigs significantly (E-value of <e−20) align was sorghum, while the least was switchgrass (Table 3). Whether this results is due to actual divergence of the species examined or to the current genomic knowledge of these species is unclear. All unigenes were examined for their association with GO terms from levels 1 to 4 of molecular, cellular, and biological categories, with 45,810 (47.0%), 48,905 (50.2%), and 48,852 (50.2%) associated, respectively. The GO biological terms were further grouped into 15 defined categories and their relative abundance was graphed (Fig. 2). Upon examination of library overlaps of contigs with an E-value of <e−20, no unique sequences were identified from the NCBI nonredundant database. When library overlaps were compared with the contigs from the other prevalent associated databases (Poaceae, S. bicolor, and Z. mays) and the contigs that have been assigned a GO term, unique sequences were found to occur in all four libraries (Fig. 3), with the most occurring in the GO library and the least within the Z. mays library. A total 54.3% of all contigs could be annotated on the basis of one of the four categories used for development Fig. 3. Only 16,661 or 22.5% of singletons could be aligned to the same four categories (data not shown).
|Database||Contigs which aligned significantly||Total number of contigs|
Comparison to Sorghum Genome
To further ascertain the genome coverage of this PCG transcriptome analysis, all contigs and singletons were aligned to the S. bicolor genome (Fig. 4). Similar to the research performed on switchgrass (Tobias et al., 2008), the S. bicolor genome was chosen for comparison because it is believed to be the species most closely related to PCG with a comprehensive sequenced genome. A total of 13,892 (52.8%) of the contig sequences matched genome sequences with an E-value of <e−20, forming the 10 putative chromosomes (Fig. 4). There are 139 contig sequences mapped to supercluster regions with an E-value of <e−20, and these contigs are displayed as the 17 unique regions that were not associated with the 10 putative chromosomes (Fig. 4).
EST–SSR Marker Development
Screening all the large Contigs (>500 bp long, n = 6489) for SSR regions with BatchPrimer 3.0 identified a total of 841 SSR regions. These were found in 704 contigs with a frequency of 3.2%. Among the 841 SSR regions, were di-, tri-, tetra-, penta-, and hexanucleotide repeat motifs. GC-rich repeats were identified with the most abundant regions being CCG/GGC, CGC/GCG, and CGG/GCC, which accounted for 18.5% of all the identified SSR regions. A random set of 20 primer sets were designed, encompassed SSR regions, and were used to amplify PCG genomic regions. All 20 primer sets produced amplicons with varying degrees of polymorphism, ranging from monomorphic single bands to a maximum of 22 amplicons per reaction (Fig. 5). The 20 EST–SSR regions which were amplified were aligned to determine their best match to both the S. bicolor and Z. mays (Table 1). Seven of the 20 EST–SSR regions could not be associated with currently known protein coding regions of S. bicolor or Z. mays but were associated with transcripts of unknown function.
Among the EST–SSRs, a total of 48 contigs contained di- and trinucleotide repeat motifs of longer than 15 nucleotides. A BLASTn alignment to the NCBI database of all 48 contigs identified six contigs with no significant alignment to any grass species, 14 contigs that aligned significantly (E-value of <e−20) to sequences from other grass species but not to the SSR region, and 28 with significant alignment (E-value of <e−20) to sequences from other grass species including the SSR regions (Table 4). Of the 28 contigs that showed alignment to other grass SSR regions, only six contained SSR regions identified as being highly conserved in grass species (Kantety et al., 2002). Of the 28 EST–SSR regions that aligned to similar SSR regions in other grass species, 18 aligned to sequences in rice, and their corresponding primer sequences were then used to perform ePCR (Rotmistrovsky et al., 2004). Only three PCG EST–SSR primers were found to produce potential PCR products. These were from contigs designated as 00531 and 25016, which each produced one potential PCR product (from NM_001049469 and NM_001068592, respectively) and 20426, which produced two potential products that were aligned to two rice sequences (NM_001056341 and NM_001061810).
|Contig||Rice||Sorghum||Maize||Wheat||Barley||Bamboo||Other||Contains SSR||SSR motif|
|contig00099||No significant alignments|
|contig09517||No significant alignments|
|contig12311||No significant alignments|
|contig16180||No significant alignments|
|contig22231||No significant alignments|
|contig23465||No significant alignments|
Sequences Related to Lignin and Cellulose Biosynthesis
On the basis of the lignin or phenylpropanoid biosynthetic pathway as determined by Humphrey and Chapple (2002), gene families for seven phenylpropanoid pathway enzymes were identified from the annotations, specifically annotations from S. bicolor, Z. mays, Poaceae, and S. officinarum. These seven enzymes were caffeic acid 3-O-methyltransferase (COMT), cinnamate-4-hydroxylase (C4H), phenylalanine ammonia lyase (PAL), cinnamyl-alcohol dehydrogenase (CAD), caffeoyl-CoA 3-O-methyltransferase (CCoAOMT), 4-(hydroxyl)cinnamoyl CoA ligase (4CL), and cinnamoyl-CoA reductase (CCR). The phenylpropanoid enzyme in PCG which had the most annotations was 4CL, followed by CAD, with the remaining five in relatively equal amounts (Table 5). Sequences, which align to the remaining enzymes in the phenylpropanoid pathway, as described by Humphrey and Chapple (2002) were not identified in this analysis.
|S. bicolor||Z. mays||Poaceae||S. officinarum|
Gene families associated with the production of cellulose for cell walls were also identified from the annotations of the PCG transcriptome. The cellulose gene families identified within the PCG transcriptome can be grouped according to their function into cellulose synthases (CesA), cellulose synthase-like genes (Csl), glycosyl transferases, and callose synthase genes. Members of the CesA gene family were the most abundant of the cellulose biosynthetic pathway genes (data not shown).
From the seven phenylpropanoid enzymes (lignin biosynthesis), contigs relating to the enzymes COMT and CCoAOMT were further examined; these two enzymes are members of the S-adenosylmethionine dependent methyl transferase (AdoMet-MTase), class I superfamily. From the contig annotations, three COMT sequences were identified: contig13867 (from both Z. mays and Poaceae), contig03448, and contig16262 (from S. officinarum). From the contig annotations five CCoAOMT sequences were identified: contig04815, contig06137, contig06523, contig18666 (from both S. bicolor and Z. mays), and contig21502 (from both S. bicolor and Z. mays). The relative length in base pairs of each contig and its corresponding translated protein length in amino acids are displayed in Table 6. The five contigs were used to perform a BLASTx search of the NCBI database and identified similar regions in accessions from Z. mays, S. bicolor, O. sativa, T. aestivum, and Brachypodium sp. Contig 18666 was not long enough to be able to perform a BLASTx search and was dropped from further phylogenetic analysis at that point. A total of 67 accessions (Table 7) were further utilized to perform phylogenetic analysis (Fig. 6). The phylogenetic analysis grouped the accessions into seven groups, corresponding to the contigs, with two partial outliers.
|Contig||Annotation||Enzyme||Length (bp)||Translated length (aa)|
|04815||S. bicolor, Z. mays||CCoAOMT||942||229|
|06137||S. bicolor, Z. mays||CCoAOMT||558||170|
|06523||S. bicolor, Z. mays||CCoAOMT||334||76|
|13867||Z. mays, Poaceae||COMT||259||69|
|18666||S. bicolor, Z. mays||CCoAOMT||192||N/A|
|21502||S. bicolor, Poaceae||CCoAOMT||998||260|
|Designation||NCBI Accession||Species||Phylogentic group|
Before this research, limited investigations had been performed on the PCG genome and transcriptome analysis. The results of this investigation into the transcriptome of PCG have vastly increased our general knowledge of this uncharacterized genome. Potential limitations will be inherent in any PCG transcriptome analysis because of the polyploid nature of the species and potential paralogy of gene families. Application of the normalization techniques before cDNA sequencing has resolved a lot of these inherent problems. The success of the normalization of the cDNA used in this analysis is demonstrated by the alignments of the 10 contigs with the most reads.
The most frequently expressed enzyme in plants is RuBisCo (ribulose-1,5-bisphosphate carboxylase oxygenase). Absence of sequences coding for RuBisCo in the top 10 contigs indicates the effectiveness of the normalization technique. The depth of coverage for contigs depends on the extent of normalization. It was not exceptional for the PCG transcriptome analysis with 54.3 and 22.5% of the PCG contigs and singletons, respectively, to be associated with a function via alignment. In the transcriptome analysis of another nonmodel organism, Melitaea cinxia (L.), Vera et al. (2008) noted a similar depth variation among contigs, while rarer transcripts were often hard to find.
Additional potential challenges arise in the investigation of the PCG genome because of the lack of information about this nonmodel species. Sorghum appears to be the most closely related to PCG from among all of the species whose genome sequences are currently known or nearly so. This assumption is supported by the most PCG contigs being annotated to the sorghum genome. By aligning the PCG contigs and singletons to the sequences of sorghum, a map indicating the distribution of the PCG sequence was obtained. This map indicates a balanced distribution of the PCG sequences across the sorghum genome in addition to the presence of unique PCG sequences. This comparison and association between PCG and sorghum validates the utilization of sorghum as a model system for further PCG genome comparisons.
PCG transcriptome analysis has identified numerous markers ready for use in various applications, from gene identification to mapping. One class of markers identified were GC-rich EST–SSRs that are very similar to other grass species (Kantety et al., 2002; Tobias et al., 2008). While PCG showed a similar frequency of overall EST–SSRs to other grass species, the motifs occurred at different frequencies (Kantety et al., 2002). The EST–SSR markers that were tested in this study showed varying degrees of polymorphism, from monomorphic to highly polymorphic. It is of interest to note that the EST–SSR marker that was monomorphic was designed from a primer that aligned to a lignin biosynthetic pathway gene, indicating that while this sequence may fit the criteria for an SSR, an increase or decrease of the SSR region may have deleterious effects on the functional expression of this sequence. Furthermore, by utilizing known genes from closely related species such as sorghum, the identification of putative genes associated with the cell wall biosynthetic pathways have been identified. Examination of these expressed sequences across diverse genotypes of PCG could provide information suitable for breeding and other studies. Utilizing ePCR, identified some of the EST–SSRs that could produce cross-species amplification, specifically with rice. Other cross-species amplification could be possible but would require actual PCR since the status of the remaining grass genome sequence is not to the same level as rice and subsequently cannot currently be used for ePCR.
Phylogenetic analysis of the contigs that were annotated as being either of the phenylpropanoid (lignin) biosynthetic pathway enzymes, COMT or CCoAOMT, provided additional validation of the assembly of the 454 GS FLX sequence reads. Each of the seven phenylpropanoid contigs clustered separately from each other, with the first branch occurring between the two enzymes COMT and CCoAOMT. This first major branch between the two highly related enzymes indicates that the assembly provides detailed information which will allow the identification of additional genes of interest. The specificity of the assembly is particularly demonstrated by contig06523 and contig13867, which were the smallest of all the translated amino acid sequences with less than 100 amino acids each, both clustered individually. While the separate clustering of all seven contigs may indicate gene families present within the phenylpropanoid biosynthetic pathway, none of the seven contigs appeared to cover an entire gene; for example, the two small contigs appeared to be from the C-terminus region of the protein. Further investigations need to be performed to determine if these seven contigs are from separate enzyme coding genes.
This research reports the first investigation of PCG transcriptome sequencing in a 454 GS FLX, which has provided a base and framework for expression and genome analysis in this species. The comprehensive data set from this research will assist in elucidating further genomic information and genetic improvement for this species and ultimately its genus and the Chloridoideae subfamily as a whole. A number of molecular markers suitable for the development of molecular maps, gene identification, and comparative genomics studies have been identified for this nonmodel species. The results from this research will be utilized in collaboration with other work being performed to turn PCG into a viable cellulosic biomass crop.