Rosaceae Family Unigene v3

Analysis NameRosaceae Family Unigene v3
SourceGenbank Rosaceae ESTs (Jun 1, 2006)
Date performed2006-06-01

Many sequencing projects around the world are depositing ESTs from the genus Rosaceae in the NCBI dbEST database. However, not all of these ESTs are of high quality. To filter, we crossmatched the public sequences against NCBI's UniVec database and used the BLAST sequence similarity algorithm to remove species-specific chloroplast, mitochondrial, tRNA, and rRNA sequences.

To reduce redundancy and create longer transcripts we assembled the ESTs within each of the five genera (Fragaria, Malus, Prunus, Pyrus, and Rosa) using the CAP3 1 program. We then took the contigs and singlets from these five assemblies and assembled them together, again using CAP3 with -p 90.

For some sequences, we were able to obtain the original trace files and incorporate the phred quality values for each base into the original genera assembly. The final assembly has been annotated by BLAST sequence similarity searching 2 against Swiss-Prot 3and TrEMBL 3, and TAIR 4's Arabidopsis proteins.

For more information on this project please contact the GDR development team.

All the Rosaceae ESTs from GenBank on June 14, 2006 were included in this assembly. The parameters used for CAP3 were -p 90. The CAP3 output generates assembled contigs and singlets. The number of tentative unigenes for this assembly is comprised of the combined contigs and singlets from the final assembly.

Processing Summary
 Number of ESTs available  364105
 Number of ESTs available after filtering  359001
 Average Length  581
 Number of Contigs(CAP3 Assembly, -p 90 )  13764
 Average Length of Contigs  1023
 Number of Singlets  76573
 Number of Putative Unigenes  90337


  1. Huan, X. and Madan, A. (1999). CAP3: A DNA sequence assembly program. Genome Research, 9, 868-877.
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol. 215(3):403-10.
  3. Boeckmann B., Bairoch A., Apweiler R., Blatter M.-C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I., Pilbout S., and Sneider M. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research. 31:365-370.
  4. Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P. (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gate way to Arabidopsis biology, research materials and community. Nucleic Acids Res.. 31(1):224-8.


Library Information
The Rosaceae ESTs used for this assembly were downloaded on June 14th, 2006


 EST Libraries
 Number of ESTs available  359001
 # of Species  17
 # of Libraries  151
 # of Tissues  23
 # of Development Stages  58


View detailed chart of libraries.

 Fragaria vesca  13453
 Fragaria x ananassa  5276
 Malus hybrid rootstock  320
 Malus sieboldii  1126
 Malus x domestica  245545
 Malus x domestica x Malus sieversii  3916
 Prunus armeniaca  14710
 Prunus avium  21
 Prunus avium x cerasus x canescens  84
 Prunus cerasus  12
 Prunus dulcis  3776
 Prunus persica  65148
 Pyrus communis  234
 Pyrus communis x Pyrus ussuriensis  81
 Pyrus pyrifolia  15
 Rosa chinensis  1790
 Rosa hybrid cultivar  3494



Homology was determined using the BLASTx algorithm for the Rosaceae Contigs and Singlets vs. the Swiss-Prot and TrEMBL databases. Only matches with an E-value of 1.0 e-9 or better were recorded. Swiss-Prot is a curated protein database with a high level of annotation and a minimal level of redundancy, and TrEMBL is a computer-annotated supplement of Swiss-Prot that contains all the translations of TrEMBL nucleotide sequence entries not yet integrated in Swiss-Prot.

 Homology of Rosaceae Contigs
 Number of Contigs  13764
 Number (%) of Contigs with a Match in Swiss-Prot Database
 Download Excel Spreadsheet
 8326 (60.5%)
 Number (%) of Contigs with a Match in TrEMBL Database
 Download Excel Spreadsheet
 11808 (85.8%)


 Homology of Rosaceae Singlets
 Number of Singlets  76573
 Number (%) of Singlets with a Match in Swiss-Prot Database
  Download Excel Spreadsheet
 27948 (36.5%)
 Number (%) of Singlets with a Match in TrEMBL Database
 Download Excel Spreadsheet
 48827 (63.8%)


Microsatellite Analysis

The type and frequency of simple sequence repeats in this unigene assembly (v3) were determined using the program. For these searches, SSRs are defined as dinucleotides repeated at least 5 times, trinucleotides repeated at least 4 times, tetranucleotides repeated at least 3 times, or pentanucleotides repeated at least 3 times.

 Sequence information
 Number of Sequences  359001  13764
 Number of Sequences Having One Or More SSRs  71462  5899
 Percentage of Sequences Having One Or More SSRs  19.9%  42.9%
 Total Number of SSRs Found  91487  4104
 Number of Motifs  799  392

Frequency of Motif Type - ESTs

 Motif Length  Frequency  Percentage Frequency
 2bp  38302  41.9%
 3bp  38034  41.6%
 4bp  11785  12.9%
 5bp  3361  3.7%

Frequency of Motif Type - Contigs

 Motif Length  Frequency  Percentage Frequency
 2bp  2749  46.6%
 3bp  2115  35.9%
 4bp  781  13.2%
 5bp  254  4.3%




Contact Details
 Name  Main, Dorrie
 Lab  Department of Horticulture
 Organization  Washington State University
 Address  45 Johnson Hall, Pullman, WA 99164
 Telephone  509-335-2774
 Fax  509-335-8690



No publications are currently available.

Contig GO Terms

The GO Terms ( were determined by comparing the contigs against Swiss-Prot using BLAST. The Sprot2GO annotation file was then used to map go terms to the sequences using relevant matches (1e-9).

4054 Contigs have Biological Process annotation:

4656 Contigs have Cellular Component annotation:


5865 Contigs have Molecular Function annotation:


ESNP Summary

eSNP Summary not availble for Rosaceae Assembly V3