About Us | Help Videos | Contact Us | Subscriptions

The Plant Genome - Original Research

Comparative Analysis and Functional Annotation of a Large Expressed Sequence Tag Collection of Apple


This article in TPG

  1. Vol. 2 No. 1, p. 23-38
    unlockOPEN ACCESS
    Received: Nov 10, 2008
    Accepted: Feb 13, 2009

Request Permissions

  1. Ksenija Gasic,
  2. Delkin O. Gonzalez,
  3. Jyothi Thimmapuram,
  4. Lei Liu,
  5. Mickael Malnoy,
  6. George Gong,
  7. Yuepeng Han,
  8. Lila O. Vodkin,
  9. Herb S. Aldwinckle,
  10. Natalie J. Carroll,
  11. Kathryn S. Orvis,
  12. Peter Goldsbrough,
  13. Sandra Clifton,
  14. Deana Pape,
  15. Lucinda Fulton,
  16. John Martin,
  17. Brenda Theising,
  18. Michael E. Wisniewski,
  19. Gennaro Fazio,
  20. Frank A. Feltus and
  21. Schuyler S. Korban 
  1. K. Gasic, Y. Han, and S.S. Korban, Dep. of Natural Resources and Environmental Sciences, Univ. of Illinois, 310 E.R. Madigan Bldg., 1201 W. Gregory Dr., Urbana, IL 61801; D.O. Gonzalez and L.O. Vodkin, Dep. of Crop Sciences, Univ. of Illinois, 384 E.R. Madigan Bldg., 1201 W. Gregory Dr., Urbana, IL 61801; J. Thimmapuram, L. Liu, and G. Gong, Biotechnology W.M. Keck Center, Univ. of Illinois, 354 E.R. Madigan Bldg., 1201 W. Gregory Dr., Urbana, IL 61801; M. Malnoy and H.S. Aldwinckle, Dep. of Plant Pathology, New York State Agricultural Experiment Station, Cornell Univ., 630 W. North St., Geneva, NY 14456; N.J. Carroll and K.S. Orvis, Dep. of Youth Development and Agricultural Education, Purdue Univ., West Lafayette, IN 47907; P. Goldsbrough, Dep. of Botany and Plant Pathology, Purdue Univ., West Lafayette, IN 47907; S. Clifton, D. Pape, L. Fulton, J. Martin, and B. Theising, Genome Sequencing Center, Washington Univ., School of Medicine, 4444 Forest Park Blvd., St. Louis, MO 63108; M.E. Wisniewski, USDA-ARS, Appalachian Fruit Research Station, 2217 Wiltshire Rd., Kearneysville, WV 25430; G. Fazio, USDA-ARS, Plant Genetic Resources Unit, 630 W. North St., Geneva, NY 14456; F.A. Feltus, Dep. of Genetics and Biochemistry, Clemson Univ., 302-C Biosystems Research Complex, 51 New Cherry Rd., Clemson, SC 29634. K. Gasic, present address: Dep. of Horticulture, Clemson Univ., 112BRC, 51 New Cherry Rd., Clemson, SC, 29634. M. Malnoy, present address: IASMA Research Center, Via E. Mach 1, San Michele all'Adige (TN), 38010, Italy. This work was funded by National Science Foundation Grant No. DBI-03-21701.


A total of 34 apple (Malus × domestica Borkh.) cDNA libraries were constructed from root, leaf, bud, shoot, flower, and fruit tissues, at various developmental stages and/or under biotic or abiotic stress conditions, and of several genotypes. From these libraries, 190,425 clones were partially sequenced from the 5′ end and 42,619 clones were sequenced from the 3′ end, and a total of 182,241 high-quality expressed sequence tags (ESTs) were obtained. These coalesced into 23,442 tentative contigs and 9843 singletons, for a total of 33,825 apple unigenes. Functional annotation of this unigene set revealed an even distribution of apple sequences among the three main gene ontology categories. Of ∼33,000 apple unigenes, 8437 (25%) had no detectable homologs (E >0.1) in the Arabidopsis genome. When the entire apple unigene set was compared with the entire citrus [Citrus sinensis (L.) Osbeck] unigene set and the poplar (Populus trichocarpa Torr. & Gray) predicted proteome, both members of the core eudicot and rosids clade, 13,521 of apple unigenes matched one or more sequences in citrus, while 25,817 had counterparts in the poplar protein database. Apple–Arabidopsis–citrus–poplar comparisons revealed closer evolutionary relationships between apple and poplar than with the other two species. Genes involved in basic metabolic pathways appear to be largely conserved among apple, citrus, poplar, and Arabidopsis.


    BAC, bacterial artificial chromosome; EST, expressed sequence tag; GO, gene ontology; LRR, leucine-rich repeat; MIPS, Munich Information Center for Protein Sequences; NCBI, National Center for Biotechnology Information; nr, nonredundant; PCR, polymerase chain reaction; SNP, single nucleotide polymorphism; SSR, simple sequence repeat; ToL, Tree of Life

Apple (Malus × domestica Borkh.) is the most important deciduous tree fruit crop grown in the United States and around the world (Morgan and Richards, 1993). Apple production figures in the United States have surpassed 5.6 million t at an estimated value of over $1.7 billion. The apple is consumed fresh, and in multiple processed and cooked forms including juice, sauce, and canned fruit cocktails, among various other uses. The importance of apple in a balanced human diet is well known, and pertains to its fiber, vitamins, and antioxidant contents (Boyer and Liu, 2004).

The genus Malus belongs to the family Rosaceae. This family includes several important genera that account for most of our important deciduous fruit crops including apple (Malus), pear (Pyrus), and stone fruits (Prunus) such as peach [Prunus persica (L.) Batsch], cherry [Prunus avium (L.) L.], plum (Prunus domestica L.), apricot (Prunus armeniaca L.), almond [Prunus dulcis (Mill.) D.A. Webb], as well as other valuable ornamental plants including rose (Rosa), medlar (Mespilus), and hawthorn (Crataegus), among others (Challice, 1974). Among these various genera, Malus serves as the most commercially valuable.

Most cultivated apples are diploids (2n = 34), self-incompatible, and display juvenile periods of 6 to 10 yr or more. The apple has a relatively small genome, 1.54 pg DNA/2C or 750 Mb per haploid genome, which is similar to that of the sorghum [Sorghum bicolor (L.) Moench] genome and about the same size as the tomato (Solanum lycopersicum L.) genome (Arumuganathan and Earle, 1991; Tatum et al., 2005). Molecular mapping studies of the apple have been underway with over 1200 isozymes, random amplified polymorphic DNAs, restriction fragment length polymorphisms, amplified fragment length polymorphisms, simple sequence repeats (SSRs), expressed sequence tag (EST)–SSRs, and single nucleotide polymorphisms (SNPs) mapped to different linkage groups (Maliepaard et al., 1998; Liebhard et al., 2003; Naik et al., 2006; Chagné et al., 2008; Celton et al., 2009). At least three bacterial artificial chromosome (BAC) libraries have been constructed for the apple from different genotypes (Vinatzer et al., 2001; Xu et al., 2001, 2002). These BAC libraries have been successfully used for cloning genes of interest (Xu and Korban, 2002; Han et al., 2007). Several apple cultivars have been transformed via Agrobacterium-mediated transformation (James et al., 1989; Yao et al., 1995), and promising transgenic lines with enhanced resistance to important diseases such as apple scab [Venturia inaequalis (Cke.) Wint.] and fire blight [Erwinia amylovora (Burrill) Winslow] have been developed (Aldwinckle et al., 2003; Belfanti et al., 2004; Malnoy et al., 2008). Thus, the apple serves as an ideal model system for all members of the Rosaceae family and is primed to benefit from research efforts in functional genomics.

Large-scale single-pass sequencing of cDNA clones, randomly picked from cDNA libraries, is a very powerful approach for gene discovery and provides an overview of transcriptional activities within tissues (Adams et al., 1993). Expressed sequenced tags are the most widely sequenced nucleotide commodities from plant genomes, as they provide robust sequence resources that can be exploited for gene discovery, genome annotation, and comparative genomics (Arabidopsis Genome Initiative, 2000). Since the completed sequencing of the Arabidopsis genome (Arabidopsis Genome Initiative, 2000), widespread major genomic efforts have been underway for various other plant species (Shoemaker et al., 2002; Van der Hoeven et al., 2002; Forment et al., 2005; Horn et al., 2005; International Rice Genome Project, 2005; Moser et al., 2005; Newcomb et al., 2006; Tuskan et al., 2006). The number of publicly available ESTs from plant species is growing, and many of the sequencing projects are focused on crop species. Recently, there are reports on EST sequencing of fruit species, such as tomato (Van der Hoeven et al., 2002), citrus (Forment et al., 2005), peach (Horn et al., 2005), and apple (Newcomb et al., 2006), as well as use of EST data for functional (Sterky et al., 2004; Park et al., 2006) and comparative (Fulton et al., 2002; Albert et al., 2005; Brenner et al., 2005) genomics studies. Additionally, EST data have also demonstrated their usefulness in enhancing the utility of genetic maps (Naik et al., 2006; Chagné et al., 2008; Shulaev et al., 2008).

In this study, we report on collection and analysis of 182,241 high-quality apple ESTs from different tissues, under different conditions, and from different genotypes. This has vastly expanded on the apple EST database previously reported by Newcomb et al. (2006) with additional tissues, treatments, and genotypes. We have used computational comparisons against Arabidopsis genomic sequences to functionally annotate apple sequences. Comparisons of apple to Arabidopsis and to other tree species, such as citrus and poplar, have revealed a set of genes most likely associated with tree formation. Moreover, this has provided a global overview of the extent to which genes have diverged since apple, Arabidopsis, citrus, and poplar have undergone divergence from their last common ancestor.

Materials and Methods

cDNA libraries were constructed from different tissues, both vegetative and reproductive tissues, and under different biotic and abiotic stresses, of nine apple cultivars, including GoldRush, Royal Gala, Fuji, Braeburn, Suncrisp, Granny Smith, Red Delicious, Jonagold, and Wijcik; three apple rootstocks, including M.9, M.111, and Geneva 3041; and one interspecific hybrid, M. × domestica cv. Geneva 3041 × M. sieversii (Ledeb.) M. Roem. (Table 1). Tissues used for library construction were collected from trees grown at the University of Illinois, Urbana, or from greenhouse-grown plants subjected to various biotic (Cornell University and USDA-ARS, Geneva, NY) or abiotic (USDA-ARS, Kearneysville, WV) stresses.

View Full Table | Close Full ViewTable 1.

Apple cDNA libraries.

Library code Source Library strategy Apple cultivar
Mdas Leaf tissue challenged with Venturia inaequalis Primary M. × domestica cv. GoldRush
Mdbd Mixed-bud stages Normalized M. × domestica cv. GoldRush
Mdfb Leaf tissue challenged with Erwinia amylovora Primary M. × domestica cv. Red Delicious
Mdfr Fruit – 9 DAP Primary M. × domestica cv. GoldRush
Mdfbg Leaf tissue challenged with Erwinia amylovora Primary Apple rootstock Geneva 3041
Mdfrb Fruit – 36 DAP Primary M. × domestica cv. Braeburn
Mdfrf Fruit – 36 DAP Primary M. × domestica cv. Fuji
Mdfrg Fruit – 36 DAP Primary M. × domestica cv. Granny Smith
Mdfrj Fruit – 36 DAP Primary M. × domestica cv. Jonagold
Mdfrs Fruit – 36 DAP Primary M. × domestica cv. Suncrisp
Mdfrt Mixed-fruit stages Normalized M. × domestica cv. GoldRush
Mdfw Mixed-floral stages Normalized M. × domestica cv. GoldRush
Mdfwb Flower balloon stage Primary M. × domestica cv. Braeburn
Mdfwf Flower balloon stage Primary M. × domestica cv. Fuji
Mdfwg Flower balloon stage Primary M. × domestica cv. Granny Smith
Mdfwj Flower balloon stage Primary M. × domesticac cv. Jonagold
Mdfws Flower balloon stage Primary M. × domestica cv. Suncrisp
Mdlr Leaf challenged with leaf roller insect Primary M. × domestica cv. GoldRush
Mdltb Bud tissue exposed to low temperature Primary M. × domestica cv. Royal Gala
Mdltl Leaf tissue exposed to low temperature Primary M. × domestica cv. Royal Gala
Mdltx Xylem exposed to low temperature Primary M. × domestica cv. Royal Gala
Mdlv Leaf – Stage I Primary M. × domestica cv. GoldRush
Mdlv2 Leaf – Stage II Primary M. × domestica cv. GoldRush
Mdlv3 Leaf – Stage III Primary M. × domestica cv. GoldRush
Mdlv4 Leaf – Stage IV Primary M. × domestica cv. GoldRush
Mdrta Root tissue Primary Apple rootstock M.9.
Mdrtb Root tissue Primary Apple rootstock M.111
Mdrtc Root tissue Primary Apple rootstock Geneva 3041
Mdrtp Root tissue challenged with Phytophtora cactorum Primary M. sieversii × Geneva 3041
Mdst Mixed-shoot stages Normalized M. × domestica cv. GoldRush
Mdstw Actively growing shoot Primary M. × domestica cv. Wijcik
Mdwdb Bud tissue exposed to water deficit Primary M. × domestica cv. Royal Gala
Mdwdl Leaf tissue exposed to water deficit Primary M. × domestica cv. Royal Gala
Mdwdr Root tissue exposed to water deficit Primary M. × domestica cv. Royal Gala
Primary library construction using Approach 2—see Materials and Methods.
Normalized cDNA libraries were constructed from several developmental stages: bud—three stages (dormant terminal and lateral, and active lateral); flower—four stages (bud—pink stage; balloon—full pink stage; full bloom and petal fall—after pollination); fruit—six stages (young fruitlets 9, 16, and 44 d after pollination [DAP]; maturing fruit 104 and 145 DAP; and ripe fruit 166 DAP); shoot—three stages (dormant, active, and actively growing).
§Days after pollination.
Primary library construction using Approach 1—see Materials and Methods.

Synthesis of cDNA Libraries

Apple primary cDNA libraries were constructed using two different approaches, and normalization was conducted for four of these libraries.

For the first approach, used for constructing 23 libraries (Table 1), the following steps were used. Total RNA was extracted from frozen tissues using a modified cetyltrimethyl ammonium bromide method (Gasic et al., 2005). Poly(A)+ mRNA was isolated twice from total RNA from each stage using the Oligotex Direct mRNA kit (Qiagen, Valencia, CA). mRNA was reverse-transcribed into double-stranded cDNA using a modified oligo18(dT) primer with an identifying tag sequence (Table 2). For those four libraries that were normalized, cDNAs from different stages were pooled in equal amounts before adaptor ligation. cDNA libraries were constructed following procedures described in Bonaldo et al. (1996). Double-stranded cDNAs were size-selected to enrich for molecules >500 bp, EcoRI adapters (Promega, Madison, WI) ligated at both ends, and then digested with NotI. The cDNAs were then directionally cloned into EcoRI (5′)–NotI (3′) digested pBluescript II SK(+) phagemid vector (Stratagene, Cedar Creek, TX), and electoporated into ElectroMax DH10B cells (Invitrogen Life Technologies, Carlsbad, CA) to generate the primary library. Four libraries were further normalized following the procedure described by Soares et al. (1994). Purified plasmid DNA from the primary library was converted to single-stranded circles and used as a template for polymerase chain reaction (PCR) amplification using the T7 and T3 priming sites flanking cloned cDNA inserts. Purified PCR products, representing the entire cloned cDNA population, were used as drivers for normalization. Hybridization between a single-stranded library and PCR products was performed for 44 h at 30°C. Unhybridized single-stranded DNA circles were hydroxyapatite-purified from hybridized DNA rendered partially double-stranded, converted to double-stranded DNA, and electroporated into ElectroMax DH10B cells (Invitrogen) to generate the normalized library.

View Full Table | Close Full ViewTable 2.

Modified oligo18(dT) primers with identifying tag sequence.

Tag sequence Tag identification from 5′ end Tag identification from 3′ end
A Insert 18(A)TCGTG CACGA18(T) insert
B Insert 18(A)TGCTG CAGCA18(T) insert
I Insert 18(A)TCGGT ACCGA18(T) insert
J Insert 18(A)TGCGA TCGCA18(T) insert
K Insert 18(A)TCGGA TCCGA18(T) insert
H Insert 18(A)TGCGT ACGCA18(T) insert

A second approach was used for constructing an additional 11 primary cDNA libraries (Table 1). Total RNA was extracted from freeze-dried tissue using either a standard phenol–chloroform extraction method (leaf tissue stage 1) or a modified method (all other tissues) as described by Wang and Vodkin (1994). poly(A)+ mRNA (4–6 μg) was isolated from 1.0 mg of total RNA using the PolyATtract mRNA Isolation System III (Promega).

Libraries from leaf tissue stage 1 and fruit tissue 9 DAP were constructed using the following methodology: complementary DNA was synthesized from mRNA using an anchored Poly (dT) sequence with a NotI restriction site. SalI linker adapters were ligated to blunt-ended cDNA fragments followed by restriction with NotI. The cDNA fragments were directionally cloned into the NotI–SalI restriction site of the pSPORT 1 vector (Invitrogen). Libraries from leaf tissue stages 2 and 3 were prepared by synthesizing complementary DNA using a hybrid oligo (dT) linker-primer containing an XhoI restriction site. EcoRI adapters were ligated to blunt-ended cDNA fragments, followed by restriction with XhoI. cDNA inserts were protected from XhoI digestion via methylation during the first-strand cDNA synthesis. cDNA fragments were directionally cloned into the EcoRI–XhoI restriction site of the pBluescript II SK(+) XR vector (Stratagene). Ligated cDNA fragments from the two different methodologies were transformed into Escherichia coli ElectroMax DH10B host cells.

Clone Preparation and Sequencing

To confirm presence and size of inserts in both primary and normalized libraries, 192 white colonies per library were randomly hand-picked and grown overnight in 200 μL of YT media supplemented with ampicillin and glycerol in two 96-well plates. A PCR reaction was performed with M13 universal forward and reverse primers to amplify cloned inserts. Agarose gel analysis (1%) of PCR products confirmed presence and average size of cloned fragments (Table 3). Plates were submitted to the W.M. Keck Sequencing Center (University of Illinois at Urbana-Champaign) to verify the quality of cloned fragments by sequencing.

View Full Table | Close Full ViewTable 3.

Sequence assembly results.

Sequence assembly No. of sequences Avg. length
Total sequences 190,425
Total high-quality sequences 182,241
ESTs in contigs 172,398 ND
Total no. of contigs 23,442 865
Singleton ESTs 9,843 441
Total no. of apple unique sequences (unigenes) 33,285
Number of assembled sequences matching known genes 26,333 ND
Number of sequences specific to apple 6,952 ND
Avg. insert size 1500
Avg. sequence size 465
Clean length.
ESTs, expressed sequence tags.
§Not determined.

All cDNA libraries were then plated, and individual colonies were picked robotically and assigned unique identifiers. Glycerol stocks of cDNA clones for 5′ end-sequencing were sent in 384-well format to the Genome Sequencing Center (Washington University, St. Louis, MO). Clones were then transferred into 96-well blocks and incubated at 37°C for 24 h while shaking at 25 × g (297 rpm) in an incubator shaker. Clones were processed according to Marra et al. (1999) using a high-throughput 96-well microwave protocol. Dideoxy terminator sequencing reactions were conducted as described by Hillier et al. (2006).

Contig Assembly

Contig assembly was done on “clean” sequences having minimum lengths of 100 nucleotides and minimum quality scores of 20, following vector trimming and discarding of low-quality sequences. The final clean sequences were used for clustering and assembly using Paracel Transcript Assembler (Paracel, Inc., Pasadena, CA). Contaminant sequences like E. coli, mitochondrial, chloroplast, cloning vector, and RNA were filtered during the cleanup stage. Repeat sequences were masked and annotated. The EST sequences were then clustered based on local similarity scores of pairwise comparisons using 88% similarity over 100 nucleotides. Clusters containing only a single sequence were grouped as singlets. The EST clusters were assembled into tentative contigs (contiguous sequence) by multiple-sequence alignment, generating a consensus sequence for each cluster, with criteria of 95% identity and 30-nucleotide overlap. As EST clusters might not share enough similarity over their entire length to be assembled into a single contig, multiple contigs might be generated per cluster. Moreover, multiple contigs might also be generated when ESTs within a cluster represent an alternative splice form of the gene. Those ESTs remaining in a cluster following formation of contigs were designated as cluster_singlets. Unique sequences for each library included contigs, cluster_singlets, and singlets.

Sequence Analysis and Annotation

Putative functions of the apple unique ESTs were classified according to the Gene Ontology (GO) Consortium (2001) scheme. The representation of protein families, domains, and functional sites within apple unique sequences was determined using InterProScan (EMBL, Cambridge, UK). Subsets of apple unique sequences, of at least 300 bp in length, were additionally cataloged into 22 functional categories based on similarity to Arabidopsis [Arabidopsis thaliana (L.) Heynh.] proteins and functional annotation available for Arabidopsis proteins following Munich Information Center for Protein Sequences (MIPS) (http://mips.gsf.de [verified 22 Jan. 2009]) Functional Catalogue (FunCat) schema (Ruepp et al., 2004).

Data Sets Used for Analyses

The apple (M. × domestica Borkh.) unigene set, and the ESTs used for this unigene set build are available on the Apple EST project Web site (http://titan.biotec.uiuc.edu/apple/apple [verified 22 Jan. 2009]) under the Search ESTs (Final Assembly) link. The Arabidopsis thaliana protein sequences are available through the Arabidopsis Information Resource (http://www.arabidopsis.org [verified 22 Jan. 2009]). All UniGene data sets for citrus, grape (Vitis vinifera L.), pine (Pinus taeda L.), poplar, soybean (Glycine max L.), and tomato are available on National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene [verified 22 Jan. 2009]). Oryza sativa L. protein sequences and the nonredundant protein database are also available on NCBI (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein&cmd [verified 22 Jan. 2009]) and the poplar protein sequences are available at JGI Populus trichocarpa v1.1 (http://genome.jgi-psf.org/Poptr1_1/Poptr1_1.home.html [verified 22 Jan. 2009]).


Synthesis and Sequencing from Normalized and Primary Apple cDNA Libraries

A total of 34 directionally cloned apple cDNA libraries were constructed from six different tissues (shoot, bud, leaf, flower, fruit, and root); six treatments (biotic and abiotic stresses); and 13 genotypes, including nine cultivars (Braeburn, Fuji, GoldRush, Granny Smith, Jonagold, Red Delicious, Royal Gala, and Suncrisp); three rootstocks (M.9, M.111, and Geneva 3041); and one interspecific hybrid (M. × domestica cv. Geneva 3041 × M. sieversii) (Table 1). Most libraries (11) were constructed from different tissues of the cultivar GoldRush, representing 67% of the total ESTs. Both primary and normalized cDNA libraries were constructed from GoldRush. Seven primary libraries, corresponding to four developmental stages of apple leaf, first developmental stage of apple fruit, and two pest-challenged (apple scab and obliquebanded leafroller [Choristoneura rosaceana (Harris)]) leaf tissues, were constructed. Furthermore, four normalized libraries from flower, fruit, bud, and stem apple tissues were constructed. Each normalized library, encompassing three to six developmental stages, was evenly represented, and 3′-labeled with an identifying tag sequence (Tables 1 and 2). To increase the likelihood of SNP detection, 10 additional primary libraries from flower (balloon stage) and fruit (9 d after pollination), were constructed from five apple cultivars, including Braeburn, Fuji, Granny Smith, Jonagold, and Suncrisp, and ∼6,000 clones were sequenced from each library (Table 1). Sequences from root tissue of apple rootstocks and tissues subjected to the two treatments, mentioned above, contributed 6% each to the total apple EST pool (Table 1).

From these libraries, 190,425 clones were partially sequenced from the 5′ end and 42,619 clones were sequenced from the 3′ end. A total of 182,241 high-quality ESTs, of at least 100 bp in size and with an average clean sequence length of 465 bp, were obtained (Table 3). These ESTs assembled into 23,442 tentative consensus sequences and 9843 singletons, thus representing 33,285 apple unique sequences (Table 3). The contig sequence length ranged from 101 to 4043 bp with an average of 865 bp, while the singleton length ranged from 100 to 1344 bp with an average of 441 bp (Fig. 1).

Figure 1.
Figure 1.

Distribution of sequence length (bp) of consensus sequences (TCs) and singletons that constitute the apple unigene set.


The largest number of ESTs sequenced from a single library originated from the normalized fruit library (26,917), representing cDNAs from six stages of fruit development. This was followed by the normalized flower library (22,237), representing cDNAs from four stages of flower development.

Functional Annotation of an Apple Unigene Set

Annotation of the EST-derived apple unigene set was performed on the basis of the existing annotation available for the proteome of Arabidopsis. tBLASTX was used to screen the entire apple unigene set against a subset of the Arabidopsis proteome to which functional categories have been assigned (Arabidopsis Genome Initiative, 2000) (http://www.arabidopsis.org). Apple unigenes with expected values (E-values) of E ≤1.0 × 10−5 were assigned to corresponding Arabidopsis annotation. This approach is based on the assumption that functionality is transferable, based on sequence conservation, for which there are many exceptions. These annotations followed the gene ontology (GO) vocabularies (Gene Ontology Consortium, 2001) [www.geneontology.org (verified 22 Jan. 2009)] as well as The Arabidopsis Information Page Arabidopsis anatomy and developmental stage ontologies (Berardini et al., 2004). The GO terms were organized into three categories representing molecular functions, biological processes, and cellular components (Gene Ontology Consortium, 2001). The sum of apple unigenes per category did not add up to 100%, as some apple unigenes were classified into more than one category.

Evaluation of the primary BLAST matches revealed the presence of two major groups of apple unigene sequences with various potentials for predicting their cellular functions. The first group consisted of apple unigene sequences matching sequences of known proteins (E <1.0 × 10−5), a complement of 26,333 unigene sequences and accounting for 79% of the total unigene set, and were likely to be transcripts of genes having similar functions (Table 3). The GO vocabulary was used to assign functions to this group. The second group consisted of 6952 apple unigene sequences, accounting for 21% of the total apple unigene set, with no matches in the GenBank database. These were deemed to be apple specific or genes most likely associated with tree formation (Table 3).

Of the total apple unigene set, 13,980 (42%) unigenes were annotated into the Molecular Function GO category (describing the biochemical activity performed by the gene product), 10,694 (32%) unigenes into the Biological Process GO category (describing the ordered assembly of more than one molecular function), and 13,338 (40%) into the Cellular Component GO category (describing subcellular compartments of a cell) (Fig. 2A). Among the molecular functions, the most highly represented categories were the catalytic activity (GO:0003824) (53%), binding (GO:0005488) (49%), transporter activity (GO:0005215) (9%), and transcription regulation activity (GO:0030528) (9%) (Fig. 2B). Among the biological processes, the largest proportion of functionally assigned unigenes fell into the metabolic (GO:0008152) and cellular processes (GO:0009987), 75% each, while localization (GO:0051179), biological regulation (GO:0062007), and response to stimulus (GO:0050896) comprised 14, 12, and 12% of the unigenes, respectively (Fig. 2C). For the cell component category, almost all unigene sequences were annotated into cell (GO:0005623)–cell part (GO:0044464) subcategory (99%), and 38% of these were annotated to membrane (GO:0016020) and 64% to organelle (GO:0043226) subcategories (Fig. 2D). Together, all three GO categories accounted for ∼79% of the assigned apple unigene set.

Figure 2.
Figure 2.

Distribution of apple unigenes whose putative functions could be assigned through annotation. Role categories according to gene ontology (Gene Ontology Consortium, 2001) were derived from BLASTX matches of apple unigenes against the annotated Arabidopsis proteome and are as follows: (A) Distribution of annotated apple unigene set to three GO functional groups; (B) molecular function; (C) biological process; (D) cell component.


In addition, a subset of 1982 apple unique sequences (of at least 300 bp in length), was translated into 23,076 open reading frames ≥50 amino acids and assigned to 22 functional categories based on functional annotations available for Arabidopsis proteins following the MIPS (http://mips.gsf.de) FunCat schema (Ruepp et al., 2004). Only 10% of apple sequences from the subset did not have a match in Arabidopsis. Half (44.79%) of those sequences that have matches are similar to unclassified proteins in Arabidopsis. Of classified proteins in the apple subset, the Metabolism category contained the highest number of genes (7.61%), and this concurred with previous observations for apple and Arabidopsis (Newcomb et al., 2006) (Table 4).

View Full Table | Close Full ViewTable 4.

Munich Information Center for Protein Sequences (http://mips.gsf.de) Functional Catalogue analysis of subset of apple unique sequences.

No. Functional category Apple unique sequences
01 Metabolism 7.61
02 Energy 0.13
10 Cell cycle and DNA processing 1.34
11 Transcription 2.64
12 Protein synthesis 1.95
14 Protein fate 5.02
16 Protein with binding function or cofactor requirement 2.77
18 Protein activity regulation 0.22
20 Cellular transport 4.76
30 Cellular communication/signal transduction mechanism 4.11
32 Cell rescue, defense, and virulence 2.77
34 Interaction with the cellular environment 0.43
36 Interaction with the environment 0.82
40 Cell fate 1.86
41 Development 1.90
42 Biogenesis of cellular components 4.15
70 Subcellular localization 6.53
73 Cell type localization 0.09
75 Tissue localization 0.13
77 Organ localization 0.30
98 Classification not yet clear-cut 5.71
99 Unclassified proteins 44.79

Comparison of apple unique sequences with InterPro protein family database (Zdobnov and Apweiler, 2001; Mulder et al., 2007), performed to determine representation of protein families, domains, and functional sites, revealed matches to 2425 InterPro families. The InterPro families with the most frequent representation in the apple unique sequences data set are presented in Table 5.

View Full Table | Close Full ViewTable 5.

Fifty most common InterPro families represented within the apple unique sequences.

InterPro no. Description Frequency
IPR000719 Protein kinase 1040
IPR001611 Leucine-rich repeat 380
IPR001245 Tyr protein kinase 297
IPR002290 Ser-Thr protein kinase 257
IPR000504 RNA recognition motif 250
IPR008271 Ser-Thr protein kinase, active site 227
IPR001680 G-protein β WD-40 repeat 196
IPR000504 RNA-binding region RNP-1 (RNA recognition motif) 186
IPR001841 Zinc finger, RING 173
IPR001128 Cytochrome P450 167
IPR002048 Calcium-binding EF-hand 132
IPR002885 PPR repeat 119
IPR014778 MYB DNA-binding domain 104
IPR000608 Ubiquitin-conjugating enzymes 92
IPR001471 Pathogenesis-related transcriptional factor and ERF 87
IPR001344 Chlorophyll a/b-binding protein 86
IPR013753 Ras GTPase superfamily 86
IPR001440 TPR repeat 85
IPR000571 Zinc finger, C-x8-C-x5-C-x3-H type 80
IPR005123 2OG-Fe(II) oxygenase 80
IPR002110 Ankyrin 77
IPR001810 Cyclin-like F-box 76
IPR001993 Mitochondrial substrate carrier 74
IPR001471 Epimearase–NAD-dependent epimerase/dehydratase 70
IPR001878 Zinc finger, CCHC type 70
IPR013766 Thioredoxin-type domain 66
IPR007125 Histone core 62
IPR014045 Protein phosphatase 2C,N-terminal 62
IPR001087 Lipolytic enzyme, G-D-S-L 59
IPR002016 Haem peroxidase, plant/fungal/bacterial 59
IPR001092 Basic helix-loop-helix (bHLH) dimerization domain bHLH 57
IPR001623 Heat-shock protein DnaJ, N terminus 57
IPR000008 C2 domain 56
IPR006121 Heavy metal transport/detoxification protein 56
IPR002198 Short-chain dehydrogenase/reductase 55
IPR003439 ABC transporter 53
IPR013057 Amino acid transporter, transmembrane 53
IPR000626 Ubiquitin 52
IPR007087 Zinc finger, C2H2 type 49
IPR010847 Harpin-induced I 49
IPR000916 Bet v I allergen 46
IPR001356 Homeobox 44
IPR013126 Heat-shock protein 70 40
IPR002182 NB-ARC 39
IPR000425 Major intrinsic protein 38
IPR001395 Aldo/keto reductase 37
IPR002130 Peptidyl-prolyl cis-trans isomerase, cyclophilin type 33
IPR000157 TIR 30
IPR007493 Protein of unknown function DUF538 28
IPR001938 Thaumatin, pathogenesis related 24
2OG-Fe(II), 2-oxoglutarate and Fe (II)-dependent oxygenase; ABC, adenosine triphosphate–binding cassette; CCHC, CysCysHisCys; ERF, ethylene-responsive-element-binding factor; NAD, nicotinamide adenine dinucleotide; NB-ARC, nucleotide binding domain shared by Apaf-1, certain R (resistance) gene products, and CED-4; PPR, pentatricopeptide repeat; TIR, Toll/interleukin-1 receptor; TPR, tetratrico peptide repeat.

Protein kinases (IPR000719), with 1040 apple unigenes, are the most abundant families. Detailed analysis of transcription factors, via automated predictions based on comparisons to the InterPro database, identified the MYB transcription factor family as the most common in apple sequences (Table 6).

View Full Table | Close Full ViewTable 6.

The 10 most common transcription factor (TF) families in apple identified by searches of automated predictions using InterPro.

Top 10 TF family descriptions No. apple unigene sequences InterPro accession nos. TF family rank
Apple Arabidopsis Rice
MYB 228 IPR014778 1 (1) 1, 11, 14 1, 9
Pathogenesis related 87 IPR001471 2 (2) 2 2
C2H2 Zn finger 52 IPR007087 3 (3) 6 7, 8, 10
Homeobox 65 IPR001356 4 (4) 7 ND
C2C2 Zn finger 62 IPR000315 5 (5) 5 3
Basic helix-loop-helix 61 IPR001092 6 (7) 3 ND
C3H-type 1 Zn finger 43 IPR000571 7 (8) 18 ND
NAC 39 IPR008917 8 (6) 4 ND
WRKY 38 IPR003657 9 (9) 10 4
bZip 35 IPR004827 10 (10) 9 5
Total no. TFs 1091 1, 470 1306
TF family rank based on data from Newcomb et al. (2006).
Based on data from Riechmann et al. (2000).
§Based on data from Goff et al. (2002).
Family not determined by Goff et al. (2002).

Comparisons of the Apple Unigene Set with Those of Other Plant Species

Large-scale EST and genomic sequences databases from multiple plant organisms are now available in public databases. With available Unigene or Proteome databases for multiple plant species, it is possible to explore relationships and differences among these species. Therefore, the apple unigene set has been compared with available Unigene or Proteome collections.

Proteome collections from seven angiosperm species (Tree of Life [ToL]; http://tolweb.org/tree/phylogeny.html [verified 22 Jan. 2009]) were subjected to tBLASTX to identify sequences encoding similar proteins (Table 7). It was revealed that six of these species belong to eudicots (clade rosids: soybean, poplar, Arabidopsis, and citrus; clade Vitaceae (grape); and clade asterids [tomato]), and one belonged to a monocot (rice). In addition, apple sequences were compared to those of pine, a tree species that is of furthest lineage to apple, and to the nonredundant (nr) proteome database. As database collections for most of the selected plant species are continuously expanding, estimations of similarities and/or differences presented herein should be considered tentative.

View Full Table | Close Full ViewTable 7.

Comparisons of apple unique expressed sequence tags (ESTs) (33,285) with unigenes or proteomes from other plant species.

Species Database No. of hits Similarity
Arabidopsis Proteome 24,848 75
Citrus UniGene 13,521 40
Grape UniGene 18,691 56
Pine UniGene 14,693 44
Poplar UniGene 11,061 33
Poplar Proteome 25,817 77
Rice Proteome 23,768 71
Soybean UniGene 18,690 56
Tomato UniGene 16,871 50
Nonredundant Proteome 26,333 79
Total number of apple uniqe ESTs matching sequences in a given database.
Similarity was calculated as no. of hits/no. of apple unique ESTs.
§Populus trichocarpa v1.1 protein database.

The percentages of apple sequences that did not show any similarities to other databases varied between 21% in the nr and 60% in the citrus database. The poplar predicted proteome, along with Arabidopsis and rice proteome databases, produced the highest numbers of matches with the apple unigene set, 77, 75, and 71%, respectively (Table 7). The observed high level of sequence similarity between apple and each of poplar and Arabidopsis is in agreement with their position on the ToL. Interestingly, comparable levels of similarity, 40 to 56%, were observed between apple sequences and those of other plant species, such as citrus, tomato, pine, soybean, and grape, and irrespective of their phylogenetic relationships to apple (Albert et al., 2005). This observation was greatly influenced by the size of the database as well as the type of sampled tissue for the given plant species at the time of comparison.

Apple–Arabidopsis Comparisons

With the availability of the complete genome sequence of Arabidopsis along with other ongoing major genomic efforts for various plant species (Arabidopsis Genome Initiative, 2000; Van der Hoeven et al., 2002; Brenner et al., 2005; Forment et al., 2005; International Rice Genome Project, 2005; Moser et al., 2005; Velasco et al., 2008), it is now feasible to perform comparative studies between highly divergent genomes by screening large EST databases of different plant species against the Arabidopsis genomic sequence (Fulton et al., 2002; Albert et al., 2005).

To evaluate the relationship between apple and Arabidopsis, a computational approach was used as described by Van der Hoeven et al. (2002). General trends in gene conservation and functionality between apple and Arabidopsis were evaluated by screening all apple unigenes in all possible translational frames (tBLASTX) against the complete Arabidopsis genomic sequence (http://www.arabidopsis.org), and resultant E-values of BLAST similarity searches were used as estimates of sequence conservation. As pointed out by Van der Hoeven et al. (2002), two factors could have effects on such an assumption: sequence length and type of analysis performed. Many of the unigene sequences were not full-length sequences, thereby lowering potential E-values. BLAST analysis conducts local alignments, resulting in high E-values over short stretches of sequence conservation, thus favoring conservation of domains but not complete genes. Given our primary intent to use E-values to reveal general trends in the conservation of sequences and their functionality, we think that such drawbacks are unlikely to affect our overall conclusions.

Approximately 75% of apple unigene sequences had significant matches at the amino acid level (E <1.0 × 10−5) to one or more translated portions of the Arabidopsis genomic sequence. The highest proportion of apple unigenes (30%) with matches to Arabidopsis genome fell into categories with strong homology to their Arabidopsis counterparts (E <1.0 × 10−50 to >1.0 × 10−100) (Fig. 3A).

Figure 3.
Figure 3.

Distribution of conservation between apple unigenes and genes in the Arabidopsis genome based on tBLASTX scores. Tentative contigs and singletons for whole apple unigene set. E ≥0.1, apple sequences with no match to Arabidopsis; E <1.0 ×10−50, “slow-evolving” genes; E ≤1.0 × 10−15 to E ≥1.0 × 10−50, “intermediate-evolving” genes; E >1.0 × 10−15, “fast-evolving” genes. (A) All apple unique sequences. (B) Only those apple sequences for which putative function could be established through annotation.


A closer examination of the putative functional role and the degree of sequence similarity (to its closest Arabidopsis counterpart) of each apple unigene was performed to further analyze the nature of both fast- and slow-evolving genes identified by apple–Arabidopsis comparisons (Fig. 3B). This would likely provide further insight into types of genes–gene functions that were either more stable across plant taxa or those likely to have evolved more rapidly as species evolved (Van der Hoeven et al., 2002). Of the 24,848 apple unigenes, ∼50% showed high (E <1.0 × 10−50), ∼40% intermediate (E ≤1.0 × 10−15 to E ≥1.0 × 10−50), and ∼10% low (E >1.0 × 10−15) levels of conservation with Arabidopsis genes (Fig. 3A).

Detailed analysis of assigned putative functional roles revealed that within the “slow-evolving” category, the highest proportion of genes were annotated in metabolism and biosynthesis categories, 61 and 60%, respectively (Fig. 3B). The frequency of genes assigned to these two categories decreased as we moved to the “intermediate-evolving” (E ≤1.0× 10−15 to E ≥1.0 × 10−50; 31 and 32%, respectively) and the “fast-evolving” (E > 1.0 × 10−15; 8 and 9%, respectively) categories, thus suggesting that genes involved in metabolism and biosynthesis remained highly conserved during plant evolution between apple and Arabidopsis (Fig. 3B). A similar trend was observed for genes involved in transport, signaling, intracellular, and membrane functions. However, genes encoding transcription and transcription factor activity appeared to be in transition toward faster evolving categories, changing from 30% in the slow-evolving category to 50 and 19% in the intermediate- and fast-evolving categories (Fig. 3B). Interestingly, genes annotated in catalytic activity appeared only in the fast-evolving category, suggesting the furthest divergence between apple and Arabidopsis. Genes involved in binding and organelle functions (e.g., chloroplast and mitochondria), exhibited similar frequencies in slow- and intermediate-evolving categories, with a slight decline in the fast-evolving category (Fig. 3B).

Identification of Apple-Specific Unigenes

Of 33,285 apple unigenes, 8437 (25%) had no detectable homologs (E >0.1) in the Arabidopsis genome (Fig. 3A). This set of unigenes was further searched against the GenBank protein database to identify putative matches. A small proportion of these unigenes (816, 10%; E <1.0 × 10−15) showed similarities to protein sequences in GenBank. Of those showing homology with one or more GenBank entries, a subset of 457 (56%) unigenes with the most significant matches (E <1.0 × 10−30) were annotated for putative gene functions.

A large proportion (60%) of these 457 unigene sequences revealed perfect matches with either bacterial, viral, fungal, or nonplant sequences. Only 183 of the original 457 unigenes had no detectable counterparts in the Arabidopsis genome but matched genes from other plant species (those available in the GenBank protein database). Twenty-seven of these unigenes (15%) corresponded to 10 gene families that appeared to be specific to Rosaceae, having matches with other rosaceous plants but not with other plant families (Table 8). Six of these gene families were Malus specific, and they included Mal d 1 (nine unigenes assigned), a major food allergen in apple (Son et al., 1999); polyphenol oxidase (three unigenes assigned), known to be involved in browning of damaged fruits (Haruta et al., 1999) and in herbivory resistance (Murata et al., 1997); MADS box proteins (a single unigene assigned), a developmental transcription factor (Leland and Podila, 2004), including a fruit acidity–related protein (Mal-DDNA-DQ417661) (a single unigene assigned) from M. × domestica (Yao et al., 2007); a dehydrin (a single unigene assigned), a class of plant proteins related to drought and cold stress responses (Garcia-Bañuelos et al., 2006; Yao et al., 2007); and AHAP2 transcription factor (a single unigene assigned).

View Full Table | Close Full ViewTable 8.

Putative functions of Malus-specific genes not conserved between apple and Arabidopsis.

Match description, species (GenBank no.) E-value Length query (amino acids) No. of isoforms
Mal d 1-like, Malus × domestica (AAS00042–AAD00053) 1.0 × 10−160 101–163 9
Polyphenol oxidase 2 precursor, Malus × domestica (AAK56323) 1.0 × 10−194 191–587 3
Polyphenol oxidase precursor, Prunus armeniaca (AAC28935) 1.0 × 10−108 99–245 2
Polyphenol oxidase, Prunus salicina var. cordata (AAW58109) 1.0 × 10−184 385 2
Transcription factor AHAP2, Malus × domestica (AAL57045) 1.0 × 10−70 180 1
Dehydrin, Malus × domestica 1.0 × 10−59 172 1
MADS box protein, Malus × domestica (CAC86183) 1.0 × 10−41 84 1
Fruit acidity–related protein, Malus × domestica (Mal-DDNA–DQ417661) 1.0 × 10−37 112 1

The majority of apple unigenes (90%), with no matches to the Arabidopsis proteome, had matches to species belonging to two clades of angiosperms, eudicots (94%) and monocots (6%) (data not shown).

Comparisons of Apple Unigenes with Those in Citrus and Poplar

Comparison of the apple unigene set with the Arabidopsis gene repertoire provides an overview of gene evolution between these two species since the time of their first divergence from their common ancestor. However, this does not provide insight into the broader context of genes that have differentiated since then and that may hold clue(s) for tree-specific gene evolution. With the advent of genomics efforts in other plant species, especially tree species, such as poplar (Tuskan et al., 2006) and citrus (Forment et al., 2005), both members of the core eudicot clade rosids (Albert et al., 2005), it is now possible to use computational comparisons between poplar and citrus and the apple unigene set to search for genes that may be linked to tree-specific characters.

In an attempt to identify such likely tree-specific genes, the apple unigene set was computationally compared with citrus unigene data set, available in NCBI UniGene database, and with the poplar predicted proteome, available at JGI Populus trichocarpa v1.1 genome site (http://genome.jgi-psf.org/Poptr1_1/Poptr1_1.home.html). The entire apple unigene set was compared with the entire citrus unigene set and the poplar predicted proteome at the amino acid level using tBLASTX (Fig. 4). Almost 40% (13,521) of apple unigenes matched one or more sequences in the citrus UniGene database (Table 7), and ∼20% (2667) of those had a highly significant counterpart in citrus (tBLASTX E <1.0 × 10−50) (Fig. 4A). The proportion of lower conserved sequences between apple and citrus (tBLASTX E ≤1.0 × 10−15 to ≥1.0 × 10−50) increased to 53% (7190), while the category of the “fastest evolving” genes comprised 27% (4115) of apple sequences (Fig. 4A). Apple–Arabidopsis–citrus comparisons revealed 189 apple sequences that had a match only in the citrus unigene set but not in the Arabidopsis proteome (Fig. 4B). Most of these sequences, 83% (157), belonged to a fast-evolving gene category (tBLASTX E >1.0 × 10−20).

Figure 4.
Figure 4.

Distribution of conservation between apple unique sequences and citrus and poplar genes based on tBLASTX scores. (A) Distribution of all apple sequences that had a match in citrus–poplar database. E <1.0 × 10−50, “slow-evolving” genes; E ≤1.0 × 10−15 to E ≥1.0 × 10−50, “intermediate-evolving” genes; E >1.0 × 10−15, “fast-evolving” genes. (B) Only those apple sequences with citrus–poplar but no Arabidopsis match. E <1.0 × 10−20, “slow-evolving” genes; E >1.0 × 10−20, “fast-evolving” genes.


Most of the apple unigenes (98%) that had counterparts in the citrus Unigene database could be functionally annotated using the Arabidopsis genomic sequence (http://arabidopsis.org), and their distribution was evenly divided among all three GO categories.

Out of 33,825 apple unigenes, 77% (25,817; tBLASTX E <1.0 × 10−5) had counterparts in the predicted proteome of Populus trichocarpa (Table 7), and ∼50% of those (13,091) had highly significant counterparts in poplar (E <1.0 × 10−50) (Fig. 4A). The proportion of intermediate-evolving sequences between apple and poplar decreased to 37%, while the fast-evolving genes encompassed only 13% of apple sequences (Fig. 4A). Functional annotation of apple–poplar matches successfully assigned putative functional roles to 60% (15,411) of apple–poplar homologs. However, 40% (10,406) of apple sequences with matches in the Poplar JGI protein database could not be annotated through GO (Fig. 4B). Most of these sequences, 77% (8033), are classified in the slow-evolving gene category (tBLASTX E <1.0 × 10−20). Further comparisons of apple sequences, those with counterparts in poplar but not with the Arabidopsis protein database, with the nr protein database available at NCBI, revealed that 3428 (13%) apple–poplar matches had no similarities to any other available protein or nucleotide database. These apple sequences were most likely to contain tree-specific genes because the majority (60%) exhibited high conservation (E <1.0 × 10−20) with poplar proteins (Fig. 4B). In addition, 21% of the proposed tentative tree-specific genes have at least 100 amino acid matches with poplar counterparts, ranging from 900 to 2796 bp in length.

To uncover patterns of conservation and divergence among apple, citrus, and poplar, we computationally compared the apple unigene set with those of both citrus and poplar, and used E-values of putative functional annotations based on the Arabidopsis–apple comparison (Fig. 5). Detailed analyses of assigned putative functional roles for apple–citrus and apple–poplar matches revealed that for almost all categories, at least 50% of apple–citrus homologs belonged to the intermediate-evolving genes and apple–poplar homologs belonged to the slow-evolving category, suggesting a higher conservation between apple and poplar than between apple and citrus (Fig. 5). For example, proportion of genes involved in metabolism and catalytic activity between apple and poplar was highest in the slow-evolving category, 64 and 62%, respectively, and decreased when moving to the intermediate-evolving (E ≤1.0 × 10−15 to E ≥1.0 × 10−50, 28 and 29% of unigenes, respectively) and the fast-evolving (E > 1.0 × 10−15; 8 and 10% of unigenes, respectively) categories, suggesting that metabolism and catalytic activity remained highly conserved in plant evolution between apple and poplar (Fig. 5). However, genes encoding biosynthesis activity appeared to be transitioning to faster evolving groups between apple and poplar, changing from 19% in the slow-evolving category to 62 and 19% in the intermediate- and fast-evolving categories, respectively (Fig. 5). As for the apple–citrus evolutionary divergence, it seems that most genes that are homologous between those two species belong to the intermediate-evolving category, suggesting further divergence between apple and citrus than that between apple and poplar (Fig. 5).

Figure 5.
Figure 5.

Distribution of conservation between apple sequences and citrus and poplar genes based on tBLASTX scores. E <1.0 × 10−50, “slow-evolving” genes; E ≤1.0 × 10−15 to E ≥1.0 × 10−50, “intermediate-evolving” genes; E >1.0 × 10−15, “fast-evolving” genes. AC, apple–citrus comparison; AP, apple–poplar comparison.


Comparisons of the apple–citrus and apple–poplar matches revealed 40% of apple unigenes with matches in both databases. Approximately 99% of apple–citrus homologs also have counterparts in the poplar proteome, while 48% (12,402 apple unigenes) of apple–poplar homologs do not have counterparts in citrus.


Large-scale single-pass sequencing of cDNA clones randomly picked from libraries is a very powerful approach for gene discovery and for providing a global profile of the transcriptional activity within tissues (Adams et al., 1993). However, both reliability of data and frequency of identifying novel sequences depend to a large extent on the quality of the constructed cDNA libraries (Bonaldo et al., 1996; Gasic et al., 2005). To increase the discovery of rare transcripts and to reduce the time involved in constructing cDNA libraries, we have combined two different approaches for cDNA library construction. The first approach involved pooling equimolar amounts of cDNAs from different developmental stages of the same apple tissues, and normalization. Using this strategy, we have developed four cDNA libraries from flower, fruit, bud, and stem tissues, comprising 14 developmental stages, from the apple cultivar GoldRush, and generated a total of 63,384 EST sequences. Each cDNA has been tagged with a different 6-nucleotide tag at the 3′ end, thus allowing us to identify the developmental stage of the target tissue. Additionally, 29 nonnormalized or primary libraries have been constructed from six different tissues (bud, shoot, leaf, flower, fruit, and root) and two treatments (biotic and abiotic stresses) from nine apple cultivars (Braeburn, Fuji, GoldRush, Granny Smith, Jonagold, Red Delicious, Royal Gala, and Suncrisp); three apple rootstocks (M.9, M.111, and Geneva 3041); and a single interspecific hybrid (M. × domestica cv. Geneva 3041 × M. sieversii), and generating a total of 118,857 EST sequences. The majority of the ESTs (67%) originated from tissues collected from a late-ripening yellow-colored fruiting apple cv. GoldRush, which has excellent fruit quality and long storageability combined with field immunity to apple scab disease, high level of resistance to apple powdery mildew [Podosphaera leucotricha (Ellis and Everh.)], and moderate resistance to the bacterial disease fire blight (Crosby et al., 1994).

Clustering of high-quality sequences reduced the number of ESTs to 33,285 apple unique sequences, comprising 23,442 tentative consensus sequences and 9843 singletons. Analysis of the overall contribution of the libraries to the data set showed that no single library contained >7% of the total number of singletons, suggesting that most of the diversity was derived by sequencing different sources of tissues, which was similar to recent findings by Newcomb et al. (2006). The reproductive tissues, consisting of 13 libraries constructed from six different genotypes, comprised 28% of all apple ESTs and provided the highest contribution (40%) to the apple unigene set. However, a high redundancy was observed in the normalized fruit library, and this was attributed to tissue sampling. The normalized fruit library comprised all six stages of apple fruit development, while the nonnormalized fruit library was derived from the first developmental stage, young fruitlets (9 d after pollination). Additionally, during the last three stages of fruit tissue collection, maturity stages I and II as well as ripe fruit, the amount of RNA in these tissues was very low. This suggested that during this period of fruit development, gene expression was low as cell–tissue differentiation has been completed by then, while ongoing cell expansion was accompanied by sugar and starch accumulation.

While taking into account the number of apple sequences used for assembly, the total number of apple unigenes obtained in this study, 33,825, is comparable to recently published estimates by Newcomb et al. (2006) and Park et al. (2006). It is likely that this is an overestimate of the actual number of apple genes present in the apple genome. Using the analogy of Arabidopsis (Arabidopsis Genome Initiative, 2000), Newcomb et al. (2006) have estimated the total number of apple genes to be ∼27,000. However, a more accurate estimate of the total number of genes in the apple genome can be made by comparing the size of the EST-derived unigene set and the percentage of predicted genes in genomic DNA (e.g., BAC sequences) that are represented by a unigene match. Recently, a first draft of the physical map of the apple genome has been constructed in our laboratory (Han et al., 2007), and efforts are underway to anchor this physical map to the genetic map. In addition, sequencing of the whole apple genome is also underway (Shulaev et al., 2008; Velasco et al., 2008). Therefore, these new genomic resources would eventually provide a more accurate accounting of the number of genes present in the apple genome.

Computational comparisons of apple unigenes against the Arabidopsis and the nr proteome database have allowed for identification of putative homologous protein sequences and assignment of putative functional roles to 75 and 79% of our transcripts, respectively. This is similar to those found in grape (Moser et al., 2005) and in other woody perennial plant species, such as peach (Horn et al., 2005), citrus (Bausher et al., 2003; Forment et al., 2005), and poplar (Sterky et al., 2004). The remaining 21% of our sequences, having no matches to any sequences in public databases, may represent apple-specific genes. Using a similar approach, Van der Hoeven et al. (2002) have been able to assign putative functions to only 30% of tomato unigenes, while Newcomb et al. (2006) have reported that only 6% of apple nonredundant sequences do not have matches in Arabidopsis. These observed differences may be attributed to differences in E-value thresholds as well as the depth of EST samplings whereby an E-value of <1.0 × 10−10 in tomato has been used compared to an E-value of <1.0 × 10−5 in apple; moreover, a sampling of ∼150,000 apple ESTs has been used by Newcomb et al. (2006) compared to ∼190,000 apple ESTs used in this study.

The GO classification of apple–Arabidopsis matches showed similar distribution of apple unigenes among the three categories, molecular function, biological process, and cellular component. In addition, representatives have been found in every major putative functional role, thus indicating that a genome-wide EST collection has been generated. Furthermore, distribution of functionally annotated apple unigenes resembles that of the full set of proteins in Arabidopsis. These findings are similar to those reported for citrus–Arabidopsis comparisons (Forment et al., 2005). Furthermore, methods of predictive bioinformatics, such as comparison with MIPS-based role classification of Arabidopsis, and matches to the InterPro protein family, have also been employed to elucidate the function of encoded proteins predicted from apple unigenes. Among the most frequently represented class of genes were the protein kinases, followed by leucine-rich repeat (LRR) and RNA recognition motif proteins. In general, our findings are in agreement with previously reported distribution of ∼43,000 apple nonredundant sequences to the InterPro protein families by Newcomb et al. (2006). However, small discrepancies attributed to differences in genotype–tissue–treatment sources for cDNA development were noted. For example, high numbers of sequences in InterPro classes that were potentially involved in disease resistance were detected in both studies; for LRR class of proteins (IPR001611), 321 were found by Newcomb et al. (2006), and 380 were found in this study. However, we detected twice as many sequences in the protein kinase (IPR000719) class than Newcomb et al. (2006) group (1040 vs. 564), and failed to detect apple unigene sequences in either NBS (nucleotide binding sites)–LRR or plant-specific LRR protein classes. In addition, other functional classes of proteins, such as putative transcription factors, were identified in both databases. Comparisons of the frequency of the most common transcription factor families in both data sets and with Arabidopsis (Riechmann et al., 2000) and rice (Goff et al., 2002) revealed similar rankings.

Comparisons of the apple unigene set with those of other plant species, available in the NCBI UniGene database, have shown various levels of similarity (Table 7). As expected, the highest level of similarity is observed with the poplar and Arabidopsis proteomes as well as with the rice protein database. The observed high level of sequence similarity between apple and each of poplar and Arabidopsis is in agreement with their position on the ToL, with apple and poplar belonging to eurosid I and Arabidopsis to eurosid II clades. On the other hand, the observed high sequence similarity between apple and rice does not agree with their placement on the ToL, but this is attributed to the available amount of sequence data and genes involved in basic metabolic pathways that have remained conserved among plant species. However, significant levels of similarities are also observed with several other plant species having different phylogenetic relationships with apple, including soybean, citrus, grape, and tomato; although all are eudicots, they belong to different families, including Fabaceae, Rutaceae, Vitaceae, and Solanaceae, respectively (Fulton et al., 2002; Albert et al., 2005). Phylogenic trees of species of the Floral Genome Project (Albert et al., 2005) indicate that there is a relatively close evolutionary relationship between apple and Arabidopsis. Thus, an evaluation of the general trends in gene conservation and functionality between apple and Arabidopsis has been initiated to reveal trends in both gene and genome divergences between these two species. Significant matches to Arabidopsis genes, likely exhibiting conserved gene functions, have been identified for 30% of apple unigenes. Similar to findings in tomato (Van der Hoeven et al., 2002), the majority of apple unigenes (80%) with no matches in Arabidopsis have unknown functions and are without matches in other genome databases. Hence, these may represent fast-evolving genes that have acquired new functions in apple and related taxa. The majority of these novel genes, such as Mald1, are confined to apple and to other rosaceous species. Additionally, an assessment of the apple gene content provides evidence for selective gene loss in the Arabidopsis pedigree which has been previously identified from the tomato–Arabidopsis comparison (Van der Hoeven et al., 2002). Examples of such selective losses include polyphenol oxidases that are present in apple, tomato, and in many other plant species, but not in Arabidopsis. If Arabidopsis and tomato lineages have diverged ∼100 to 150 million years ago during the evolution of flowering plants (Yang et al., 1999), we can speculate that apple and Arabidopsis lineages must have diverged from their common ancestor at a later date, ∼75 to 100 million years ago (Albert et al., 2005). Thus, this suggests that the loss of the polyphenol oxidase gene function in Arabidopsis must have occurred sometime following its divergence from apple.

Comparison of the apple unigene set with the Arabidopsis gene repertoire provides an overview of gene evolution between those two species since their early divergence from a common ancestor. However, this does not provide any insight into genes that have differentiated since then, particularly those that are lost from the Arabidopsis lineage and are likely to hold clues into tree-specific gene evolution. The majority of apple–citrus (13,352) and apple–poplar (25,817) matches also exhibit similarities with the Arabidopsis gene repertoire, 98 and 73%, respectively. Moreover, the majority of apple sequences present only in citrus (53%) but not in Arabidopsis belong to the fast-evolving gene category, and range in length from 168 to 2,058 bp. Conversely, 71% of apple sequences that have counterparts in the poplar predicted proteome but not in Arabidopsis belong to the slow-evolving category (E <1.0 × 10−20). Further search for functional annotation of apple–poplar matches against the nr protein database has revealed ∼3,000 apple sequences having counterparts only in poplar but not in any other available protein or nucleotide database. These apple unigenes may represent tree-specific genes, and their functional roles should be explored. Phylogenetically, apple and poplar belong to eurosid I, while citrus belongs to eurosid II, both clades of rosids. The rosids represent the largest of eight major clades of core eudicots, and include nearly one-third of all flowering plants. Single- and multigene phylogenies of rosids have identified seven major clades, and although relationships among these clades remain unresolved, DNA-based studies support its monophylogeny (Savolainen et al., 2000a, 2000b; Soltis et al., 2000, 2003; Hilu et al., 2003; Ravi et al., 2007). Thus, our “tree-specific gene set” assumption that the state of the “tree-form” is monophyletic within apple, poplar, and citrus species grouping is valid. Genes involved in basic metabolic pathways appear to be largely conserved among apple, citrus, poplar, and Arabidopsis. This finding is consistent with those for Arabidopsis, tomato, and Medicago truncatula Gaertn. (Van der Hoeven et al., 2002), and further supports the hypothesis that basic metabolic pathways remain conserved among plant species. However, genes encoding transcription factors among apple, citrus, poplar, and Arabidopsis are largely present in less conserved categories, and their frequencies seem to double when moving from slow-evolving to fast-evolving categories (Fig. 3 and 5). These appear to diverge more rapidly among plant species, thus suggesting that changes in gene regulation present a significant force in plant evolution (Doebley and Lukens, 1998; Stern, 2000). The observed evolutionary divergences among apple, poplar, citrus, and Arabidopsis correspond to their phylogenetic relationships (Albert et al., 2005), with higher conservation observed between apple and poplar than between apple and either citrus or Arabidopsis.

In summary, we present an extensive set of ESTs, derived from various genotypes, tissues, and treatments, and contributing to the overall value of these publicly available apple sequences. This data set has been used as a rich source of SSR and SNP marker development, for comparative genomics studies, and for creating an apple microarray useful for functional genomics studies and for characterizing genes involved in various biological processes.


The authors wish to thank Todd Wylie, Michael Dante, Candace Farmer, and William Courtney, all at the Genome Sequencing Center, Washington University, St. Louis, MO, for their valuable help.




  • All rights reserved. No part of this periodical may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Permission for printing and for reprinting the material contained herein has been obtained by the publisher.

Be the first to comment.

Please log in to post a comment.
*Society members, certified professionals, and authors are permitted to comment.