Rosaceae Family Unigene v4

Analysis NameRosaceae Family Unigene v4
SourceGenbank Rosaceae ESTs (Jan 1, 2008)
Date performed2008-01-01

Many sequencing projects around the world are depositing ESTs from Rosaceae in the NCBI dbEST database. The Rosaceae ESTs included in this assembly were downloaded on January 1, 2008. Not all of these ESTs are of high quality. To filter, we crossmatched the public sequences against NCBI's UniVec database and used the BLAST sequence similarity algorithm to remove species-specific chloroplast, mitochondrial, tRNA, and rRNA sequences.

To reduce redundancy and create longer transcripts we assembled these ESTs using the CAP3 1 program. For some sequences, we were able to obtain the original trace files and incorporate the phred quality values for each base into the assembly.

The final assembly has been annotated by BLAST sequence similarity searching 2 against Swiss-Prot 3and TrEMBL 3. We will also provide homology information for TAIR 4's Arabidopsis proteins, Poplar and Vitis proteins in the near future.

For more information on this project please contact the GDR development team.


Processing Summary
 Number of ESTs available  407742
 Number of ESTs available after filtering  397223
 Average Length  584
 Number of Contigs(CAP3 Assembly, -p 90 )  13549
 Average Length of Contigs  1036
 Number of Singlets  82368
 Number of Putative Unigenes  95917


  1. Huan, X. and Madan, A. (1999). CAP3: A DNA sequence assembly program. Genome Research, 9, 868-877.
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol. 215(3):403-10.
  3. Boeckmann B., Bairoch A., Apweiler R., Blatter M.-C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I., Pilbout S., and Sneider M. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research. 31:365-370.
  4. Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P. (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gate way to Arabidopsis biology, research materials and community. Nucleic Acids Res.. 31(1):224-8.


Library Information
The Rosaceae ESTs used for this assembly were downloaded on January 1st, 2008


 EST Libraries
 Number of ESTs available  413117
 # of Species  23
 # of Libraries  189
 # of Tissues  91
 # of Development Stages  68


View detailed chart of libraries.

 Fragaria vesca  42736
 Fragaria vesca subsp. vesca  2716
 Fragaria x ananassa  5430
 Malus hybrid rootstock  321
 Malus sieboldii  1210
 Malus pumila  8
 Malus x domestica  255447
 Malus x domestica x Malus sieversii  3944
 Prunus armeniaca  15105
 Prunus avium  21
 Prunus avium x cerasus x canescens  89
 Prunus cerasus  1255
 Prunus dulcis  3864
 Prunus domestica  54
 Prunus persica  70939
 Pyrus communis  244
 Pyrus communis x Pyrus ussuriensis  82
 Pyrus pyrifolia  15
 Rosa chinensis  1794
 Rosa hybrid cultivar  5563
 Rosa wichurana  1932
 Rubus idaeus  327
 Rubus idaeus subsp. strigosus  21



Homology was determined using the BLASTx algorithm for the Rosaceae Contigs and Singlets vs. the Swiss-Prot and TrEMBL databases. Only matches with an E-value of 1.0 e-6 or better were recorded. Swiss-Prot is a curated protein database with a high level of annotation and a minimal level of redundancy, and TrEMBL is a computer-annotated supplement of Swiss-Prot that contains all the translations of TrEMBL nucleotide sequence entries not yet integrated in Swiss-Prot.

 Homology of Rosaceae Contigs
 Number of Contigs  13549
 Number (%) of Contigs with a Match in Swiss-Prot Database
 Download Excel Spreadsheet
 8992 (66.4%)
 Number (%) of Contigs with a Match in TrEMBL Database
 Download Excel Spreadsheet
 12045 (88.9%)


 Homology of Rosaceae Singlets
 Number of Singlets  82368
 Number (%) of Singlets with a Match in Swiss-Prot Database
 Download Excel Spreadsheet
 33128 (40.2%)
 Number (%) of Singlets with a Match in TrEMBL Database
 Download Excel Spreadsheet
 50052 (60.8%)


Microsatellite Analysis

The type and frequency of simple sequence repeats in this unigene assembly (V4) were determined using the program.For these searches, SSRs are defined as dinucleotides repeated at least 5 times, trinucleotides repeated at least 4 times, tetranucleotides repeated at least 3 times, or pentanucleotides repeated at least 3 times.

 Sequence information
 Number of Sequences  397223
 Number of Sequences Having One Or More SSRs  77538
 Percentage of Sequences Having One Or More SSRs  19.5%
 Total Number of SSRs Found  99104
 Number of Motifs  818


Frequency of Motif Type
 Motif Length  Frequency  Percentage Frequency
 2bp  39680  40.0%
 3bp  42540  42.9%
 4bp  13000  13.1%
 5bp  3884  3.9%



Contact Details
 Name  Main, Dorrie
 Lab  Department of Horticulture
 Organization  Washington State University
 Address  45 Johnson Hall, Pullman, WA 99164
 Telephone  509-335-2774
 Fax  509-335-8690



No publications are currently available.

Contig GO Terms

The GO Terms ( were determined by comparing the contigs against Swiss-Prot using BLAST. The Sprot2GO annotation file was then used to map go terms to the sequences using relevant matches (1e-9).

ESNP Summary

eSNP Summary not availble for Rosaceae Assembly V4