Recent attention has been brought to the diploid strawberry species Fragaria vesca Coville because of its utility as a highly tractable experimental system within the economically important Rosaceae family (Folta and Davis, 2006). Strawberry is emerging as a powerful functional genomics system. Fragaria vesca’s haploid genome size of ∼200 Mb is among the smallest of the rosaceous crop plants (Shulaev et al., 2008), and its complete genome sequence will be available in 2010, providing an invaluable resource with reference to its octoploid, cultivated relatives. Most importantly, several genotypes of both diploid and octoploid strawberry are readily transformable, making it an outstanding system for functional studies. Protocols exist for diploid and octoploid lines that allow efficient regeneration of transgenic materials in a period of weeks to months (Folta and Dhingra, 2006; Mezzetti and Costantini, 2006; Oosumi et al., 2006). In vivo, strawberry is a small, perennial, herbaceous plant that cycles quickly from seed to seed. These qualities make strawberry a particularly powerful system to answer questions relative to the Rosaceae family as a whole.
Despite strawberry's advantages for reverse genetic study, there is comparatively little representation of strawberry genic sequences in public databases, slowing functional and translational genomics efforts. While Sanger sequencing projects involving Fragaria × ananassa (Folta et al., 2005) and F. vesca (Brese et al., unpublished—GenBank accession numbers DV438013–DV440728; Slovin and Rabinowicz, 2007) have been conducted, the available sequence data is too sparse to allow deep genomic analysis and gene identification. Annotation of new genome sequence would benefit from a more comprehensive accounting of expressed genes from many tissue types, treatments, and developmental stages. Documentation of expression would provide evidence for genome-wide gene assignments that currently lack corresponding functional identities (i.e., unknown and/or hypothetical proteins). A rich description of allelic variation in the octoploid genome would provide a basis to generate genomic landmarks for development of tools useful in marker-assisted selection. Much can be learned from a broad and systematic survey of the cultivated strawberry transcriptome.
In this study, octoploid strawberry plants received one of at least 30 specific treatments, including various growth regulators, light conditions, pharmacological agents, or stresses. Whole plants and detached tissues were treated and samples for RNA harvested at 1- and 24-h time points. Individual plants were dissected into discrete tissue pools (roots, leaves, petioles, stolons, runner tips, flowers, fruits, crowns, and a mixture of other minor tissues) that were individually sequenced in parallel (Margulies et al., 2005) by means of the Roche-454 GS-FLX system. Each tissue type originated from ‘Strawberry Festival’, a popular cultivar that has been well evaluated for production characteristics and has some existing molecular coverage. The many treatments over a range of tissues coupled with deep sequencing provided a complex pool of transcripts that likely represents a substantial portion of the strawberry transcriptome.
The allelic richness associated with octoploidy provides unique challenges for the analysis of massive transcript sets. Given its purported allopolyploid subgenome composition of AAA′A′BBB′B′ (Bringhurst, 1990), the octoploid strawberry harbors chromosomes (and the alleles that they carry) that relate to each other as homologs or as homeologs. Thus, expressed sequence tag (EST) contig construction must cope with the possible representation of closely similar sequences related as alleles within subgenomes, “allelic paralogs” between subgenomes, or paralogs within subgenomes. Rather than flounder in the insurmountable task of partitioning a vast sea of ESTs into gene-specific contigs, we have adopted a strategy of assembling composite contigs on the basis of homology groupings. This strategy allows convenient and illuminating consolidation of data while enabling the expression pattern of each contig member to be traced to its respective tissue and treatment source(s). This report describes the results of these computational and genomic approaches and their outcome: the substantial expansion and enrichment of available strawberry transcriptome information.
MATERIALS AND METHODS
All plants were maintained in a greenhouse under a natural ambient photoperiod (approximately 11–11.5 h daylength) and conditioned temperatures (∼19°C nights and ∼25°C days. Mature strawberry plants of University of Florida cultivar Strawberry Festival were used in these trials. The use of this cultivar is justified on the basis of its east-west lineage (Chandler et al., 2000), which may enhance allelic or genic diversity in the transcript set. The same line has been used previously in transcriptome sampling (Folta et al., 2005), and it is the dominant cultivar used for winter strawberry production in the Southeast USA (Chandler et al., 2000). It exhibits many favorable traits for cultivation, including large berry size, outstanding firmness, early flowering, and at least partial resistance to multiple diseases.
Tissues were grouped by type after treatment. Eight tissue types were selected (Fig. 1), including leaves, crowns, petioles, stolons, runner tips (of many developmental stages), roots, fruits (all stages from fertilized ovary to over-ripe), and flowers (small buds through senescent flowers, all tissue beyond pedicel) and are summarized in Supplementary Table S1. A ninth (“medley”) pool was comprised of specialized tissues, including root tips, dormant axial buds, apical meristems, leaf vascular tissue, achenes (all stages from early and unexpanded to dry), etiolated seedlings, and callusing tissue (details in Supplementary Table S2). The rationale of this mixed sample set was to obtain representation from specialized tissues that are likely rich in rare transcripts that would be otherwise absent or diluted out in the major tissue pools.
The central goal of this study was to obtain a comprehensive sampling of the Fragaria transcriptome. To accomplish this task, individual plants were subjected to an extensive set of treatments, as detailed in Table 1. The vast majority of treatments were performed on mature plants. The details of all treatments are shown in Supplementary Table S3. Briefly, pharmacological treatments were performed by simultaneous atomized sprays and drenches. In addition, such treatments were also administered to detached tissues incubated in water containing the reagent of interest. Tissue from fruits and crowns were cut lengthwise during incubation to permit treatment of internal tissues. Light treatments were provided to mature plants that were previously dark-adapted for at least 4 d. In the vast majority of cases, tissues were harvested at 1- and 24-h time points (post-treatment) to obtain transcripts associated with acute treatment and sustained acclimation under experimental conditions. Treatments and harvests were not synchronized to diurnal cycles, but a complete set of untreated tissue was prepared from plants at 3-h intervals to ensure representation of transcripts accumulating throughout light/dark cycles.
Root tips were obtained as the distal 1 mm of primary and lateral roots by cutting the tissue with a scalpel in distilled water. Axial and apical meristems were obtained by carefully dissecting crowns, sequentially removing petioles to reveal axial buds, and eventually plant apices.
Etiolated seedlings were obtained from surface-sterilized seeds stratified for 1 wk at 4°C, then moved to darkness for 14 d on 0.8% (w/v) Difco Granulated Agar (BD Inc., Franklin Lakes, NJ) supplemented with 1 mM each CaCl2 and KCl. All tissues were flash-frozen in liquid nitrogen at the appropriate time points.
Each frozen tissue sample type arising from an individual treatment and time point was pulverized with a hammer and then ground to a fine powder in liquid nitrogen with a mortar and pestle. Crowns, fruits, and mature roots were first pulverized with a hammer and then ground in a coffee grinder cooled with liquid nitrogen. Approximately 200 mg of each powdered sample was combined in one tube for each of the eight principle tissue types listed in Table 1. The medley pool was comprised of RNA from minor tissue sources and prepared separately from individual samples. Final RNA amounts were quantified and combined so that the sequenced medley pool contained equal amounts of RNA from each of these subtypes.
In all cases, RNA was isolated by a modification of the pine cone RNA extraction procedure (Chang et al., 1993). This CTAB-based protocol uses LiCl precipitation. The protocol is detailed on http://strawberrygenomics.com/protocols/; verified 27 July 2010.
cDNA Library Construction
Each of the nine RNA pools was transcribed into first strand cDNA by SuperScript II Reverse Transcriptase (Invitrogen catalog # 18064022) with Clontech's Creator SMART cDNA Library Construction Kit (catalog # 634903; Clontech, Mountain View, CA) and a modified 3′ CDS III/3′ cDNA synthesis primer (sequence: 5′-TAGAGGCCGAGGCGGCCGACATGTTTTGTTTTTTTTTCTTTTTTTTTTVN-3′). Mismatches within the polyT portion of the modified 3′ CDS III/3′ cDNA synthesis primer were included to improve pyrosequencing efficiency.
First strand cDNA products were amplified by LD PCR using Advantage 2 Polymerase Mix (Clontech catalog # 639201). The optimal number of PCR cycles for amplification of each first strand cDNA pool was determined by titration of PCR cycle number. Several optimized PCRs were then performed in parallel with aliquots of each first strand cDNA reaction to produce quantities of amplified cDNA sufficient for pyrosequencing.
Proteinase K treatment and SfiI digestion of amplified cDNAs were performed essentially as described in the protocols accompanying Clontech's Creator SMART cDNA Library Construction Kit. Enzyme amounts were increased in proportion to the quantity of cDNA treated. Primers, dNTPs, and small polynucleotide fragments were removed by passage through Montage PCR Centrifugal Filter Devices (Millipore, Billerica, MA). The quality and size distribution of cDNAs were evaluated by agarose gel electrophoresis and concentrations determined by spectrophotometry.
454 MID-tagging, Library Construction, and Sequencing
The 454 sequencing library was prepared as per the FLX protocol with the following modifications. Separate Multiplexing Identifiers (MIDs) were added to each of the cDNA samples. The cDNA was sheared with 310 mPa (45 PSI) for 2 min 45 s on the basis of experimental calibration. Each MID library consisting of ssDNA was analyzed on Experion RNA High Sense Chip (BioRad, Hercules, CA) to determine average molecule size and concentration. The libraries were pooled at equal molecule densities and titrated by enrichment. The enriched titration was then sequenced as per standard protocol (F. Hoffmann-La Roche AG, Basel). Analysis of the titration sequencing provided ratios of productivity for each sample (i.e., number of reads from each). The amounts of starting DNA were normalized and the cDNAs repooled. The pool was sequenced as per the 454 sequencing protocol. After sequencing, full analysis and image processing was performed by the gsRunBrowser tool. Reads were sorted on the basis of the MID via a custom script and placed into separate files. Each MID read file was assembled by gsAssembler, which allowed the sequencing primer sequences to be removed from each read to prevent inaccurate assemblies.
Assembly of 454 GS-FLX Reads
The 454 strawberry transcriptome sequences were assembled with the Roche 454 Newbler assembler (Margulies et al., 2005). Assembly was performed on the raw SFF format files containing 1,179,876 454 FLX transcriptome reads corresponding to nine different tissue pools of strawberry (Supplementary Table S4). 454 MID tags and SMART kit cDNA synthesis primer sequences were trimmed by Newbler during the assembly process. This assembly resulted in 26,403 contigs (average size = 389.5 bp, maximum contig length = 6785, minimum contig length = 91, N50 Size = 484). Supplementary Table S5 presents the frequency distribution of various contig sizes based solely on the Newbler-based assembly strategy.
In a parallel effort, the 454 SFF files were processed with PyroBayes (Quinlan et al., 2008), a base-caller designed to promote increased accuracy in base quality estimates for 454 reads. The resultant sequence and quality files were processed with a custom perl script designed to trim MID tags and SMART kit cDNA synthesis primer sequences were removed with LUCY (Chou and Holmes, 2001) using the following parameters: -range 30 20 20; -alignment 8 12 16; -cDNA; -minimum 2525; –debug. This processing resulted in 475,606 reads with an average read length of 201 bp. The resulting reads were assembled by the TGI Clustering tools (TGICL) (Pertea et al., 2003). TGICL automates clustering using NCBI's megablast (Zhang et al., 2000) and assembles resultant clusters using the CAP3 (Huang and Madan, 1999) assembler. This assembly resulted in 43,732 contigs (average size = ∼372 bp, maximum contig length = 2755, minimum contig length = 42, N50 Size = 390). The results of this analysis are presented in Supplementary Table S1.
Identification and Elimination of Nonstrawberry and Non-EST Contigs
Preliminary assessment of contigs obtained from both Newbler and TGICL assemblies indicated the presence of bacteria and fungal sequences, as anticipated, as subject plants were greenhouse grown. A significant amount of chloroplast genome sequence was also detected. To identify and remove this information from downstream consideration, all Newbler and TGICL contig assemblies were aligned by BLAST (WUBLASTN Version 2.0, B = 5; V = 5; E = 0.000001; links; topcombo N = 1) to bacterial (75,342 sequences), insect (75,439 sequences), chloroplast (14,293 sequences), and mitochondrial sequences (65,137 sequences) collected from GenBank (Supplementary Table S6). Fungal sequences were not removed to limit unintended elimination of actual plant sequences. In addition, any 454 transcript assembly that aligned to insect, fungi, or bacterial sequences annotated as “ribosome”, “ribosomal”, or “histone” was retained to avoid removing strawberry representatives of highly conserved ribosomal and histone proteins. Contigs that aligned to plastid, nonribosomal, or insect and/or bacterial sequences were eliminated from further analysis. The results of these analyses are shown Supplementary Table S7.
In Silico Detection of Paralogous and Allelic Sequences
The Newbler assembler produced a lower number of contigs and longer average lengths than did TGICL, suggesting that Newbler was able to better collapse variations of expressed alleles arising from homeologous loci in the octoploid genome. In addition, BLASTN alignment of the TGICL contigs to the Newbler contigs revealed several instances where multiple TGICL assembled contigs align across the length of a single Newbler contig.
While Newbler assembler produced a higher quality assembly than CAP3, it remains a possibility that significant sequence divergence between alleles may have resulted in allelic forms of the same transcript assembling into separate contigs. To address this, the Newbler contig collection was aligned to itself with WU-BLASTN (WUBLASTN Version 2.0, E = 0.000001; links; topcombo N = 1) to identify alignments that occurred over at least 90% of the length of the subject or query sequence with a minimum of 90% nucleotide identity. This resulted in the identification of 3058 clusters containing two or more similar contigs (Supplementary Table S8).
Clusters of assemblies that represent allelic variants of the same locus were distinguished from paralogous loci by mapping all Newbler contigs against the draft sequence (Shulaev et al., unpublished) of the diploid strawberry F. vesca genome using GMAP (Wu and Watanabe, 2005). Each mapped contig on the F. vesca genome was given a positional identification (Cluster Island ID). Two contigs were considered to be overlapping if at least one base overlapped between them: overlapping contigs were assigned the same genome Cluster Island ID. Cases of assemblies that cluster based on BLAST, but clearly map to different locations on the F. vesca draft sequence, likely represent parologous loci. Cases of assemblies that cluster by BLAST and occupy the same or overlapping regions of the F. vesca genome represent instances where Newbler failed to assemble alleles into a single contig. In these cases, overlapping assemblies were assigned a “Virtual Contig” ID that enabled all overlapping assemblies, and by extension their component reads, to be associated during expression analysis. For the purpose of annotation, the longest assembly sequence associated with a given “Virtual Contig” was used to represent the transcript sequence at that locus.
TGICL contigs were also mapped to the F. vesca genome by GMAP in a fashion similar to the treatment of Newbler assemblies. Mapped TGICL contigs (or clusters) that were distinct in location from the collection of Newbler assemblies (7073 contigs) were added to the final Newbler contig set. In total, 32,228 contigs were defined during the contig-building process (Supplementary Tables S9-S11).
Deconvoluting Assemblies into Reads Representing Tissues of Origin
Because each read was MID-tagged before assembly, all reads incorporated into a given contig (or virtual contig) could be mapped to their corresponding MID pools on the basis of the read names as extracted from the corresponding assembly's .ace file(s). On this basis, an approximate expression profile was associated with each contig by specifying the set of MID pools (and therefore tissue and treatment sources) that contributed to the contig. A complete set of all reads broken down into the tissue of origin is presented in Supplementary Spreadsheet SS1.
Functional assignments were determined by means of a locally installed instance of Blast2GO (Conesa et al., 2005). This installation (October 2009) retrieves gene ontology (GO) data both locally and from the Blast2GO (B2G) database from a variety of sources including the NCBI nucleotide database, GO mapping tables, and PIR protein database. The set of 32,228 contigs was compared against NCBI's nr (non-redundant) database by Blastx. Complete statistics for successful associations are shown in Supplementary Table S12. Significant matches (E-value < e−20) were identified and assigned function. Fisher's Exact Test and diacyclic graphs (DAGs) were also derived from B2G.
Tissue Distribution of High-Read-Number Contigs
RNA was isolated after harvesting eight major tissue types from mature plants (Supplementary Table S1). Eight minor tissues (Supplementary Table S2) were dissected from treated materials and placed into a ninth pool called “Medley”. A cDNA library was constructed from each pool and MID-tagged before sequencing. After assembly, contigs could be deconvoluted to indicate tissue(s) of origin. This analysis allows detection of contigs that were assembled from a specific MID pool, indicating origin from a specific tissue type. Contigs composed of the highest numbers of reads were examined for patterns of expression. The number of reads was normalized against the total read numbers for any tissue pool ensuring proper representation (Fig. 2).
Identification of Tissue-Specific Transcripts
With the reads sorted by tissue of origin, it is possible to identify contigs where the vast majority of reads derive from a single tissue source. This task has some inherent error because plant organs are connected and may be physically inseparable. For instance, root and crown read pools may harbor specific genes, yet they occur in both because crowns invariantly contain root primordial tissue, thereby potentially skewing the findings. Strawberry fruits were processed with the calyx, so reads with identity to those in the leaf pool will likely be found.
To accommodate this caveat, tissue-specific contigs were defined as contigs composed of at least 20 independent reads with at least 75% of the reads originating from the one tissue pool. These candidates were individually analyzed by Blastn and Blastx against NCBI's nr database for characterization. The general numbers of tissue-specific contigs are presented in Table 2. Annotation in Fig. 3 and 4 was devised by matching the best species match but adding the putative functional assignment indicated by the best annotated hit. All functional assignments should be viewed as putative.
|Library||Number of contigs|
Identification of Constitutive Candidate Transcripts
Constitutively expressed contig candidates were identified as those where the frequency of reads was approximately equal across all tissue types. All contigs composed of fewer than 30 reads were eliminated from consideration. The number of reads in each tissue type was normalized against the lowest number of reads, in this case leaf, to achieve equal proportions of any given contig relative to the number of reads in each pool. The normalized counts were converted to a frequency (percentage) by dividing the number of reads in each MID pool by the total number of reads. The standard deviation across tissues was then calculated. The 45 contigs exhibiting the lowest standard deviation across all tissue types (e.g., best approximating a mean of 11.11% across the nine variables or equal representation) are reported in Fig. 5. The associated contig sequences were annotated through comparison with Arabidopsis and Ricinus protein sequences by Blastx. The best match is presented along with its E-value (Fig. 6).
Metagenomic Analysis of a Cosequenced Microbial Community
Soil-borne plants harbor the coincident ecology of a laboratory greenhouse, as well as resident pests and pathogens from their field of origin. To avoid incorporating contributions from contaminating organisms into contig assemblies, 454 sequencing reads were analyzed for evidence of sources as described in Table S6. A set of contigs was clearly representative of microbial sources. Typically, these were identified by Blastn (and because they are rRNA-based do not have Blastx matches) as small regions of ribosomal RNA and like the major contigs in this experiment could be deconvoluted into tissues of origin. The frequency of reads and their tissue distribution are shown in Fig. 7. The same sequences were compared against the emerging strawberry genome sequence and the lack of similarity further evidenced sequence of a nonstrawberry origin.
Sequence and Assembly of the Cultivated Strawberry (Fragaria × ananassa) Transcriptome
Tissue-specific cDNA pools obtained from strawberry plants grown under a variety of growth conditions (Table 1; Supplementary Tables S1-S3) were tagged with separate MIDs and sequenced on the 454-FLX platform. A total of 1,179,876 reads with an average length of 204 bp (∼240.6 Mb) were obtained. The total number of reads per tissue pool is presented in Supplementary Table S4. The reads were assembled by means of two strategies. Newbler-based assembly resulted in 26,403 contigs with an average size of 390 bp and an N50 contig length of 484, while a CAP3 (Huang and Madan, 1999) based assembly resulted in 43,732 contigs with an average size of 372 bp and an N50 length of 390 bp (Supplementary Table S5). Contigs resulted from both assemblers were screened for plastid, bacterial, fungal, and insect contaminants (Supplementary Table S7).
A comparison of assembly methods shows that Newbler assembly produced a lower number of contigs and longer average lengths compared with TGICL, suggesting that Newbler was able to better collapse variations of potential alleles arising from homeologous loci in the octoploid genome. In addition, Blastn alignment of the TGICL contigs to the Newbler contigs revealed several instances where multiple TGICL assembled contigs align across the length of a single Newbler contig. When the Newbler contig collection was aligned against itself with WU-BLASTN (Materials and Methods), 1439 clusters containing two or more similar contigs were identified (Table S11) The contig clusters likely represent instances where transcripts from multiple homeologous alleles fail to collapse into a single consensus, or they may in fact represent assemblies of transcripts from paralogous loci. A complete description of how contigs and Virtual Contigs were derived is presented in Materials and Methods. A total of 32,228 contigs were defined during this process and are described in Table S8.
Gene Ontology (GO) Mapping and Annotation
Gene ontology (GO) was used to assign function and general classification to the assembled contigs. GO mapping was performed by Blast2GO, permitting annotation of the sequences based on the major Gene Ontology vocabularies. Of the collection of over 32,000 contigs, 12,728 presented high certainty BLASTX hits (E-value < e−20 Supplementary Table S13). Of these, 10,164 matched a GO record, while 2564 did not. Mapping by major biological processes is shown in Fig. 8A. Panel A illustrates the suite of contigs to be described mainly as metabolic (35.8%), cellular (31.5%), localization (7.1%), and biological regulation (7.1%). These broad categories were further broken down into 81 additional categories, where most contigs (58.5%) were predicted to have roles in metabolism. High resolution annotation of the contigs demonstrates a diversity of transcript classes represented in the remaining set, with substantial numbers of contigs encoding proteins required for redox functions (4.9%), transport (4.5%), response to stress or chemicals (3.7%), or cellular communication (1.9%).
Analysis of molecular function (Fig. 8, panel B) shows that 4217 (41.4%) of contigs were annotated with catalytic activity. Further refinement of categories indicates that these are mostly transferases (15.2% of all contigs) and hydrolases (13.8%). The 3691 contigs described as “binding” (36.3%) may be broken down predominantly into nucleotide binding (12.31%), ion binding (11.00%), nucleoside binding (9.7%), and protein binding (9.3%) proteins. A finer description of all GO categories based on molecular function vocabularies shows that ATP binding proteins are the most represented class, followed by protein binding, and zinc-ion binding. A substantial number related to stress were also present.
Associated cellular locations were also determined by Blast2GO (Fig. 8C). A deeper analysis of these data places the majority as generically “intracellular” (46.0%) or intracellular structures (26.7%), as well as intraorganellar space (22.6%). When the constituents of individual cellular contexts are totaled, a substantial number of contigs encode proteins destined for the organelle (28.0%), with a large fraction associated with membranes (15.8%). Approximately 1 to 5% is involved in protein complexes, ribonucleoprotein complexes, the organellar lumen, vesicles, endomembranes, or apoplast. A substantial set is predicted to be resident to the nucleus or chloroplast.
Tissue-Specific Patterns of Gene Ontology
The experimental design facilitated GO characterization of the contigs enriched in a given tissue type. The most exclusive transcript content was found in leaves, fruits, and crowns, with few in runner tips and flowers. The hierarchical DAG (diacyclic graph) for leaf-enriched contigs shows a bias toward transcripts associated with response to biotic stimulus (GO: 0009607, p = 2.9e−3 based on Fisher's Exact Test) and response to stress (GO: 0006950, p = 9.9e−6). Petioles and crowns showed a significant enrichment of contigs associated with carbohydrate metabolic process (GO:0005975, p = 1.4e−4 and 4.5e−7, respectively). Fruit-enriched transcripts bias toward secondary metabolic processes (GO: 0019748, p = 1.6e−5) and amino acid and derivative metabolic processes (GO: 006519, p = 1.1e−3).
Expression of High-Read-Number Contigs
RNA was isolated from eight tissue types from mature plants and from a mixture of many minor tissues dissected from treated materials (“Medley”). The 32 contigs composed of the highest numbers of reads were examined for patterns of expression, and the data are shown in Fig. 2. The data indicate high levels of expression of metabolic, structural, and regulatory genes with no particular preference toward one category or another. Clear examples of tissue enrichment are observed, such as the presence of homocysteine methyltransferase in the stolon and xyloglucan endotransglycosylase in the petiole.
Evidence of Tissue-Specific Expression
Table 2 depicts the frequency of tissue-specific transcript occurrence. Figure 3 presents the likely identity of the tissue-specific contigs. All singleton reads were removed from the table, along with the Medley pool that, as a byproduct of the experimental design, carried transcripts redundant to the major tissue pools as well as the targeted rare transcripts. On the basis of the major tissue types alone, tissue-enriched transcripts were identified in all contigs composed of >19 reads, where at least 75% are detected in a single tissue pool. Figure 3 shows that specialized tissues such as roots and flowers contain the most unique contigs. Figure 4 presents the large suite of transcripts that are fruit specific.
Constitutive Expression Candidates
To identify candidates with evidence of even expression in all tissues, the reads were sorted by tissue of origin. The reads arising from major tissue types were normalized to an equal number of reads and then converted to a percentage, taking the number of reads in each tissue and dividing by the total number of reads. The results of these analyses are presented in Fig. 5, shown here ranked by number of reads (∼expression level). Shown here are the top 45 contigs composed of at least 30 reads, corresponding to nuclear genes and featuring approximately equal read distribution among the nine pools. Among the highest relative expression is GAPDH, a transcript frequently used as a loading control for RNA gel electrophoresis. Other contigs in this set include an aspartic protease, a ribosomal protein, and a conserved hypothetical protein. The other groups are composed of fewer reads and include familiar housekeeping genes like histone H1, ubiquitin conjugating enzyme, and multiple contigs representing proteasome subunits. A set of those contigs composed of reads from all tissues (SD ≤ 0.05; 693 contigs) is presented in Table S15.
Metagenomic Analysis of the Greenhouse Strawberry Plant Community as Evidenced by Contaminating Sequences
Large-scale sequencing of the transcripts of any given plant grown ex vitro provides an opportunity to simultaneously capture sequence information of the resident microbial (and in some cases pest and arthropod) communities. Twenty-five nonstrawberry contigs were identified, representing organisms that grow in close association with strawberry. Most foreign sequences were detected in root tissue. A list of the nonstrawberry contigs is provided in Fig. 7. Four contigs, representing two different species of bdelloid rotifers [Adineta vaga (Davis) and Rotaria rotatoria (Pallas)], micro-invertebrates commonly found in fresh water and soils (Robeson et al., 2009, Fontaneto et al., 2007), were detected in root tissue and two were found at low frequency in the petiole.
Sequences resembling those of at least four strawberry pathogens [Strawberry necrotic shock virus (SNSV), Glomerella acutata Guerber & Correll, Verticillium dahliae Kleb., and Rhizopus stolonifer (Ehrenb.: Fr.) Vuill.], and one pest species [Western flower thrips, Frankliniella occidentalis (Pergande)] were found. SNSV sequences occurred at high levels in fruit. Verticillium-like sequences were found among petiole, and in small amounts among crown and root sequences. Sequences of Glomerella acutata, a teleomorph of Colletotrichum acutatum J.H. Simmonds, the causal agent of anthracnose fruit rot in strawberry, were found in fruit, whereas sequences of G. cingulata (Stoneman) Spauld. & H. Schrenk, the teleomorph of C. gloeosporioides (Penz) Penz. & Sacc. in Penz., the causal agent of anthracnose crown rot, were identified in petioles and crowns. Several distinct contigs were similar to sequences of the mycorrhizal fungus Glomus intraradices Schenck & Sm.
Other sequences identified on crowns were those resembling Metarhizium anisopliae (Metchnikoff) Sorokin (=Entomophthora anisopliae Metschn.), a naturally occurring soilborne fungus. Aspergillus nidulans Winters, another common fungus with a worldwide distribution, was found associated with petioles. In addition, a few contigs were similar to sequences of uncultured bacterium or fungus. All BLAST searches (Fig. 7) identified sequences with relatively high similarities (E-values < e−20).
There is growing interest in strawberry (genus Fragaria) as a system to study plant biology, comparative genomics, gene function, and polyploid evolution. To support these efforts 1% of a genome has been sampled through random (Pontaroli et al., 2009) and targeted (Davis et al., 2010) Sanger sequencing. The diploid genome of F. vesca is being sequenced through a community consortium and many genetic and genomics tools are being generated (Shulaev et al., 2008). The clearly missing component is comprehensive coverage of a transcriptome. Several efforts have been undertaken to remedy this, including an assessment of stress-related transcripts in developing diploid F. vesca seedlings. The current work represents the most extensive transcript assessment from various tissue types, developmental states, and treatments thus far described. The work was performed in a popular octoploid cultivar with varied lineage to generate substantial allelic variation on top of differences in the raw gene content possibly present in its reticulate genome. The goal is to produce a comprehensive set of transcripts that define a substantial representation of the messages that underlie strawberry biology. In addition, the transcriptome would be expected to contain substantial depth in allelic variation that could further inform application of genomics tools to breeding scenarios.
In these analyses, over 240 Mb of transcriptome sequence was obtained, representing 1.18 million reads, assembling into 32,228 contigs. Two independent assembly methodologies were employed, which provided an opportunity to compare and contrast assembly software for this purpose. In general, the Newbler assembler provided less contigs with larger mean lengths than a CAP3-based assembly. Fidelity of the assemblies was evaluated by BLAST aligning all Newbler contigs to each other, all CAP3 contigs to each other, and the Newbler contigs to the CAP3 contigs. Consistent with the large contig lengths produced by Newbler, this process revealed several instances where two or more CAP3-based contigs tiled across the larger Newbler assembled contigs. Special attention was paid to the potentially challenging aspects of assembling an octoploid organism's transcriptome. It is possible that transcripts may arise from parologous (within one subgenome) or homeologous (between subgenomes) loci. It is important to differentiate between these cases such that homeologs could be assembled into a common “virtual contig” that could be used in annotation and expression estimates independently from parologous sequences that constitute separate genes. The BLAST analysis described above identified pairs or groups of assemblies that shared significant sequence identity. In some cases, these represent failure to assemble allelic variants of homeologous loci. However, it is also possible that some of these cases represent assemblies of distinct loci. The assemblies were aligned to the draft genome sequence of the diploid strawberry, F. vesca, by means of GMAP. The GMAP step allowed the subtle sequence differences resident to the contigs to associate the contigs with paralogs in the genome, if present. The extra step provides improved allele accounting and identification of highly similar sequences that may arise from separate loci in a single subgenome, leading to more accurate resources for downstream analyses. In addition, mapping the Newbler and CAP3-based assemblies to the F. vesca genome enabled the identification of transcript assemblies from loci that were produced by one assembler, but not by the other. Since our analysis suggested that the Newbler assembly was of a higher quality, we chose to use the Newbler assemblies for further analysis, but we supplemented this set with CAP3 contig assemblies that represented loci missed during the Newbler assembly. A total of 7073 CAP3-based assemblies were added to the 25,155 Newbler assemblies representing unique loci, which produced a set of 32,228 transcript assemblies and represent distinct strawberry genes.
Comparison to Apple and Peach
In comparison with other species in the Rosaceae, 14,869 (46%) of the strawberry 454 transcript assemblies matched previously characterized EST assemblies (Blastn) from apple (Malus domestica Borkh.) and peach [Prunus persica (L.) Batsch] with p-values of e−20 or better (Table S12). This low percentage may reflect the low coverage depth in many of our 454 contigs or significant divergence among the species. The fact that the remaining 54% of our assemblies do not hit the available Rosaceae sequences suggests that some of the transcript assemblies represent novel strawberry sequences. The proportion of nonunique contigs may be due to paralogous sequences represented within the strawberry transcript assemblies or to nonoverlapping assemblies of strawberry sequence from the same cDNA template, since the “shotgun” nature of 454 sequencing enables simultaneous sampling of discrete template regions.
Annotation of the contigs provided cursory assignment of function. Fundamental accounting of transcript homologs in this study indicates that 12,728 (Supplementary Fig. S13) of the 32,228 contigs find homology with the nr protein database (E-value < e−20). At first glance, this number appears low, but it can be explained with little speculation. With the conservative cutoff of e−20, many sequences are excluded from further analysis. It was previously noted that the Rosaceae have a disproportionate number of transcripts without matches. In addition, the mean length of our assemblies is relatively short, mostly as a function of the breadth of tissues and treatments surveyed. It is possible that many are too short to align with confidence or represent 3′ UTR regions of transcripts, which are not well conserved between species.
Figure 6 shows that the majority of the strong matches find homology to Vitis vinifera L. sequences followed by Arabidopsis thaliana (L.) Heynh. and Populus trichocarpa Torr. & A. Gray. These three are not surprising because all three species maintain massive transcriptome coverage and associated protein database sequence. What is surprising is that a greater number of translated transcripts that exceed the threshold are identified in Physcomitrella patens (Hedw.) Bruch & Schimp. (approximately 5000) than in Malus × domestica (approximately 1000). At first this difference is not intuitive, since both Malus and Physcomitrella share similar EST representation. However, examination of the databases shows that the latter organism simply maintains significantly more translated sequence based on a greater amount of genomic sequence. These analyses describe that among organisms with substantial protein predictions in GenBank, translated strawberry sequences find the most similar sequences in grape. The relatively low representation of strawberry protein sequences in the database explains why strawberry shows such low agreement with the translated EST set.
Examination of the identity of specific transcripts shows that the greatest homology does not occur within Fragaria but typically to Prunus, Vitis, Ricinis, Poplar, or without a significant match. These analyses illustrate the value of the deep sequencing effort over these tissues of ranging treatments. Clearly, transcripts not previously reported for Fragaria have been identified. The presence of so many unknown genes that are indeed represented by genomic sequence indicates that our data set encompasses many new transcripts that are strawberry specific. The transcriptome sequencing effort therefore expands the depth of both the strawberry and the eukaryotic unigene set.
Gene Ontology Mapping and Annotation
Predictions of contig function were made by means of the Blast2GO program. The results are consistent with the diverse tissues and treatments supplied, as annotation presented a wide set of putative gene functions. The graphs in Fig. 8 illustrate the broad categories, and further breakdown of these data into higher resolution parcels describes that most contigs (58.5%) encode proteins germane to metabolism. Examination of Blast2GO level 3 vocabularies indicates that a substantial set of cellular functions are described, yet are all represented by small numbers of contigs. This is expected, since fine regulators may be expected to be expressed at low levels.
Additional analysis was made possible by examining how inferred GO function patterns shifted between MID-tagged tissue pools. For instance, it was shown that leaves exhibited significant enrichment for transcripts associated with response to environmental or biotic stress. Crowns and petioles presented a bias toward contigs associated with carbohydrate metabolism suggesting a possible role in controlling assimilate partitioning between leaves and sink tissues. Fruits offered only significant elevations in contigs associated with amino acid metabolism or secondary metabolite production, consistent with their role in generating a plethora of compounds that define the fruit.
Highly Expressed Transcripts
Contigs composed of a high number of reads were examined for tissue distribution and overall relationships with one another. The data presented in Fig. 2 indicate that the most prevalent nuclear transcripts from strawberry arise from all tissue types and do not present conspicuous groupings or patterns. Genes associated with amino acid synthesis and transport form one potential category. Several contigs corresponding to glutathione-S-transferases were detected as were regulatory proteases. The abundant number of reads used in defining these contigs suggests a potential for high-activity promoters. This hypothesis can be tested and will ultimately lead to the discovery of regulatory sequences that may be useful in the development of cis-genic plants.
The data provide a glimpse of gene expression patterns, especially for abundant transcripts. The MID-tagging of tissue populations and subsequent sequencing provides a means to estimate the expression of any gene across major organ types as well as identify potential instances of organ-specific gene expression. Care should be exercised in the interpretations because of the sweeping array of treatments and conditions used to generate each tissue population. Genes normally not expressed in the tissue may be detectable because of the stringency and scope of treatments that may induce atypical patterns. The same caveat provides great confidence in the cases where tissue-specific expression has been noted (Table 3 and Fig. 3 and 4). At the very least, the data indicate that the transcript is present or not detected in a given tissue type.
The assessment of tissue specificity provides information that can be used to design the next generation of tissue-specific tools that may enhance applications to biotechnology. Specialized tissues like a fruit or flower would be expected to contain a number of unique transcripts, which certainly is the case for fruits. Figure 4 shows that fruits contain (conservatively) 48 specific transcripts. Abundant transcripts enriched in crowns, roots, petioles, and stolons were also detected. With regard to the other MID pools, it is quite notable when a contig contains almost exclusive reads (>95%) from a single MID pool, since the wide array of treatments and crude dissections (for instance crowns and petioles or crowns and roots may share some common tissues) would suggest it unlikely that there would be such hard delineation. Figure 3 shows that high-abundance root, crown, runner-tip, and stolon-specific transcripts can be resolved. Of particular note, for instance, are contigs that are present in stolons but not petioles (contig02965). The contig encodes a SAUR-like protein. SAUR (small auxin up-regulated) transcripts increase rapidly on auxin treatment, and while implicated in auxin response, their exact functions remain elusive. It is interesting to speculate that the SAUR protein identified here might change the sensitivity or response of the stolon to an auxin signal altering its elongation. On the other hand, other contigs are enriched in both stolon and petioles. One such case, contig00627 encodes a likely glucosyltransferase. Because of its limited expression to petioles and stolons, it could possibly be central to material transport.
Many other potentially significant examples of tissue-specific expression were conservatively discounted from this table because they were detected in two MID populations, such as fruit and flower (VIRTUAL_contig23187_contig24197; 50 reads in fruit, 10 in flower). The collection of materials certainly obscured the interface of senescent flowers and early fruit, possibly contributing to this situation. Additionally, runner tips certainly would be expected to contain some overlap with all groups because they were harvested from new tips to newly established plantlets. This instance has been identified in some assembled transcripts (contig19415, 27 reads from runner tip and 4 from crown). These findings underscore the potential importance and utility of the transcripts that truly are composed from one MID population.
The importance of the Medley population is illustrated with analysis of contig10695, contig16545, and contig22041, three metallothionein-like contigs. These contigs are expressed exclusively in roots when major tissue pools are considered. However, almost three times the number of reads is found in the Medley pool. These findings suggest that these transcripts are enriched, if not exclusively expressed in the root tips, since they are a component of both the Medley and root pools.
A large number of transcripts are specific to the Medley population. The large number of contigs assembled from this specific pool validates its utility, since these transcripts would have otherwise not been likely identified in a simple breakdown of major tissues. While adding richness to the diversity of the EST set, the information is not as meaningful because of the diverse tissues in the Medley population. For instance, an abundant transcript from etiolated seedlings or imbibed achenes would likely be found in this group, yet the means is lacking to assign unique specificity.
Forty-eight contigs are composed of reads primarily originating from fruit (at least 75% of the reads were expressed in fruit). Their occurrence coincides with known genes contributing to fruit flavor, firmness, and nutriceutical compounds is important, because it implies potentially useful roles for unknowns unveiled in this work. Genes with demonstrated roles in flavor production such as the strawberry alcohol acyl transferase (Aharoni et al., 2000; Souleyre et al., 2005), quinone oxido-reductase (Raab et al., 2006), and O-methyl transferase (Lavid et al., 2002) are represented by contigs assembled from almost exclusively fruit-originating reads. Other genes with known roles in fruit softening, such as the expansins (contig18993, contig20399) and pectate lyase (see Fig. 4) are also present. Specific alleles have been detected, providing targets for gene variants that may contribute disproportionately to the softening process in mature fruits. Two distinct flavonol synthases were detected. Flavonol synthases are enzymes central to the production of flavonoids in strawberry (Halbwirth et al., 2006). Malonyl CoA decarboxlyase was also found enriched in fruit tissue. These results agree well with microarray studies in fruits, where these same transcripts were also detected (Aharoni and O'Connell, 2002).
The trends revealed in this study agree well with previous reports in that they identify contigs specifically in fruits that have demonstrated roles in fruits, primarily in softening, pigmentation, and flavor. This foundation suggests that examination of the hypothetical proteins (contig22709, contig22860), the proline rich protein (contig23596, contig19079), or the plethora of apparently strawberry-specific contigs with no significant similarity to other genes (Fig. 4) may provide a means to better understand, if not manipulate, fruit qualities.
Fruit-derived contigs contain transcripts that can occur in the fruit, in achenes, or both. Contigs that were fruit-enriched and also found in the Medley pool are candidates for further study of the possibility that they are achene-specific or enriched. Ten such fruit-enriched contigs were also detected in the Medley pool (Fig. 4) and warrant further investigation.
Examination of the fruit-enriched transcripts also illuminates some fundamental issues with sequence assembly, footnoted in Fig. 4). As mentioned previously, the strawberry acyl transferase has a central role in strawberry flavor (Aharoni et al., 2000; Souleyre et al., 2005). In the present analysis, three contigs were identified with the designation of alcohol acyl transferase or AAT and must represent errors in assembly or gene variants. Individual analysis shows that two of the contigs most related to strawberry (contigs 21030 and 24343) actually correspond to two different regions of the actual strawberry AAT transcript. This is an instance of incomplete assembly, because the reads that would span the two (that are literally separated by 30 bases) are not present or were potentially absorbed by assembly of similar contigs. Such a contig may be the other AAT (contig16032). This contig also matches an AAT gene, but from Malus, a sequence with significant sequence divergence yet clearly belonging to the transferase superfamily of proteins. Similar results were observed with expansins, endoglucanase, and proline-rich proteins, as footnoted in Fig. 4.
Constitutive Contig Candidates
One of the useful applications of the MID-pooling strategy is to identify transcripts that are present in all tissue types. Figure 5 presents 43 of these candidates. In these cases, the reads comprising the contig showed an approximately equal distribution across all tissue types, suggesting constitutive expression. The list includes a number of sequences that have been previously described as useful housekeeping genes for normalization. GAPDH, histone subunits, specific ribosomal proteins, the proteasome subunits, and ubiquitination-related proteins (Ub-conjugating enzyme, Atpob1, and SKP1) were among the constitutive candidates. Ub-conjugating enzyme and the proteasome subunits are specific members of apparent multigene families, many of which showed nearly equal reads across tissue types. It is interesting to note that while the reads related to ubiqutin-mediated proteolysis are present across tissues in approximately equal amounts, ubiquitin (or polyubiquitin) itself is not detected among the top candidates. This transcript is a frequent standard, and various alleles do appear frequently throughout the top 200 constitutive candidates that are highly expressed (not shown). This finding suggests that the suite of identified contigs may be exceptional standards for downstream normalization.
Grouping by normalized read number also allows for an approximation of relative expression level, since those composed of high numbers of reads likely arise from higher steady state transcript levels than those represented by fewer reads. Of course, the estimate of frequency from sequencing may not be an accurate reflection of actual steady state transcript levels, so results absolutely require further experimental validation. A subset of these candidates is currently being tested across tissue types by both hybridization and quantitative PCR. The goal is to generate a set of strawberry resources that could provide improved standards for expression normalization. The putative constitutive promoters of these genes may be useful in cis-genic engineering of strawberry plants. Again, the clear caveats are that transcript accumulation is not necessarily related to transcription rate, so these candidates strictly serve as starting points that will seed further analyses.
Metagenomic Analyses of Microbial Strawberry Communities
When whole soil-borne plants are analyzed by high throughput sequencing, a complement of transcripts inevitably detects nonstrawberry transcripts indicative of potential pests, pathogens, or other cohabitating organisms. This information provided an unbiased accounting of the organisms that physically interact with the plant. More importantly, the MID-tagging approach provided evidence of where the organisms are present.
The locations where sequences from strawberry pathogens and pests were found are generally consistent with the biology of the strawberry–nonstrawberry organism interactions (Table 7). SNSV infects strawberry and Rubus species (Converse, 1972; Frazier, 1966). The infection is often symptomless but can cause significant reductions in yield (Johnson et al., 1984). Sequences similar to SNSV were found at very high levels in the fruit, even though fruit did not show symptoms before RNA extraction. Interestingly, the evidence of the viral transcriptome is not found in the Medley pool, a pool that is rich in achene-derived RNA. These data indicate that the SNSV must be confined to the strawberry flesh. Screening for SNSV by polymerase chain reaction on strawberry fruit would be advisable, even on symptomless plants, since this evaluation clearly indicates latent infection. SNSV is typically transmitted through seed, pollen, and thrips.
Western flower thrips are found in Florida (Osekre et al., 2009b) and attracted to flowers (Osekre et al., 2009a), hence their tissue specificity in this study. Contigs similar to the arbuscular mycorrhiza G. intraradices were found mainly associated with root tissue. The strawberry pathogen Colletotrichum acutatum (Glomerella acutata), the causal agent of one the most important fruit rot diseases in Florida, anthracnose fruit rot, was found in the fruit, whereas Colletotrichum gloeosporioides (Glomerella cingulata), the causal agent of Colletotrichum crown rot, was found on petiole and crown. Verticillium was detected in the petioles, consistent with its pathogenic location. It is also interesting to note the common pathogens that were not found. There was no evidence of any pathogens on leaves and no evidence of powdery mildew [Sphaerotheca macularis (Ehrh.) Magnus], a pathogenic fungus commonly found on strawberry. Since bacterial sequences were screened out before the analysis, bacterial pathogens such as the causal agent of angular leaf spot (Xanthomonas fragariae Kennedy & King) were not annotated.
Screening pathogens present on strawberry plants at a glance could be of great value, especially to strawberry nurseries which need to provide disease-free certified plants to production growers. In addition, by knowing in advance what pathogens are infecting the plants even before symptoms are present can help growers to better target their disease control measures. This type of sequenced-based analysis is likely the method that will be used in the future, since it identifies new organisms before they become problematic. This type of approach could have profound dividends in food security issues.
The anticipated genome sequence of diploid strawberry will provide a new resource for studies in Fragaria and the Rosaceae in general. In addition, it will be necessary to have abundant archives of expressed transcripts to guide downstream efforts and better understand how various paralogous and homeologous genes are active in commercial materials. Despite the copious information present herein, there are limitations to this study. Ideally, additional sequencing coverage would provide deeper coverage and almost certainly better assembly of rare transcripts. However, the present work provides a novel set of data that will complement the genome sequence and permit valuable comparisons to the increasing amount of diploid EST information.