Rosaceae Family Unigene v5.0

Analysis NameRosaceae Family Unigene v5.0
SourceGenbank Rosaceae ESTs (July 1, 2012)
Date performed2012-12-19

This is the fifth version of the Rosaceae unigene. This build was used many sequencing projects around the world are depositing ESTs from Rosaceae in the NCBI dbEST database. The Rosaceae ESTs included in this assembly were downloaded on July 1, 2012.

Not all of the Rosaceae ESTs are of high quality. To filter, we crossmatched the public sequences against NCBI's UniVec database and used the BLAST sequence similarity algorithm to remove species-specific chloroplast, mitochondrial, tRNA, and rRNA sequences. To reduce redundancy and create longer transcripts we assembled these ESTs using the CAP31 program. The final assembly has been annotated by BLAST sequence similarity searching agaist Swiss-Prot2,TrEMBL3,TAIR4Arabidopsis proteinsPrunus persica5,Populus trichocarpa6 and Vitis vinifera7.

Processing Summary
Number of ESTs available 518,586
Number of ESTs available after filtering 503,851
Average Length 526
Number of Contigs(CAP3 Assembly, -p 90 ) 33,916
Average Length of Contigs 908
Number of Singlets 166,551
Number of Putative Unigenes 200,467


  1. Huan, X. and Madan, A. (1999). CAP3: A DNA sequence assembly program. Genome Research, 9, 868-877.
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol. 215(3):403-10.
  3. Boeckmann B., Bairoch A., Apweiler R., Blatter M.-C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I., Pilbout S., and Sneider M. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research. 31:365-370.
  4. Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P. (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gate way to Arabidopsis biology, research materials and community. Nucleic Acids Research. 31(1):224-8.
  6. Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A, Schein J, Sterck L, Aerts A, Bhalerao RR, Bhalerao RP, Blaudez D, Boerjan W, Brun A, Brunner A, Busov V, Campbell M, Carlson J, Chalot M, Chapman J, Chen GL, Cooper D, Coutinho PM, Couturier J, Covert S, Cronk Q, Cunningham R, Davis J, Degroeve S, Déjardin A, Depamphilis C, Detter J, Dirks B, Dubchak I, Duplessis S, Ehlting J, Ellis B, Gendler K, Goodstein D, Gribskov M, Grimwood J, Groover A, Gunter L, Hamberger B, Heinze B, Helariutta Y, Henrissat B, Holligan D, Holt R, Huang W, Islam-Faridi N, Jones S, Jones-Rhoades M, Jorgensen R, Joshi C, Kangasjärvi J, Karlsson J, Kelleher C, Kirkpatrick R, Kirst M, Kohler A, Kalluri U, Larimer F, Leebens-Mack J, Leplé JC, Locascio P, Lou Y, Lucas S, Martin F, Montanini B, Napoli C, Nelson DR, Nelson C, Nieminen K, Nilsson O, Pereda V, Peter G, Philippe R, Pilate G, Poliakov A, Razumovskaya J, Richardson P, Rinaldi C, Ritland K, Rouzé P, Ryaboy D, Schmutz J, Schrader J, Segerman B, Shin H, Siddiqui A, Sterky F, Terry A, Tsai CJ, Uberbacher E, Unneberg P, Vahala J, Wall K, Wessler S, Yang G, Yin T, Douglas C, Marra M, Sandberg G, Van de Peer Y, Rokhsar D. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). (2006) Science. Sep 15; 313(5793):1596-604
  7. French-Italian Public Consortium for Grapevine Genome Characterization.(2007) The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. Sep 27; 449(7161):463-7.
Library Information
The Rosaceae ESTs used for this assembly were downloaded on July 01, 2012


EST Libraries
Number of ESTs available 518,586
# of Species 34
# of Libraries 326
# of Tissues 102
# of Development Stages 99


View detailed chart of libraries.

Cydonia oblonga    3
Drymocallis fissa    6
Eriobotrya japonica    4
Fragaria chiloensis    137
Fragaria vesca    44,979
Fragaria vesca subsp. vesca    2,824
Fragaria x ananassa    10,855
Malus hybrid rootstock    4,804
Malus prunifolia    150
Malus pumila    51
Malus sieboldii    1,210
Malus x domestica    323,531
Malus x domestica x Malus sieversii    4,944
Photinia serratifolia    44
Potentilla indica    1
Prunus armeniaca    15,105
Prunus avium    6,035
Prunus avium x P. cerasus x P. canescens    89
Prunus cerasus    1,255
Prunus domestica    54
Prunus dulcis    3,864
Prunus mume    4,589
Prunus persica    79,815
Pyrus communis    450
Pyrus communis x Pyrus ussuriensis    82
Pyrus pyrifolia    699
Pyrus pyrifolia var. culta    1
Pyrus x bretschneideri    636
Rosa chinensis    1,794
Rosa hybrid cultivar    5,578
Rosa lucieae    1,936
Rubus idaeus    327
Rubus idaeus subsp. strigosus    56
Rubus ulmifolius var. inermis x Rubus thyrsiger    2,678



Homology was determined using the BLASTx algorithm for the Rosaceae Contigs and Singlets vs. the Swiss-Prot , TrEMBL,TAIR Arabidopsis proteins,Prunus persica, Populus trichocarpa and Vitis vinifera proteins. Only matches with an E-value of 1.0 e-6 or better were recorded. Swiss-Prot is a curated protein database with a high level of annotation and a minimal level of redundancy, and TrEMBL is a computer-annotated supplement of Swiss-Prot that contains all the translations of TrEMBL nucleotide sequence entries not yet integrated in Swiss-Prot. Homology of Rosaceae in Excel spreadsheet can be downloaded from the Downloads.

Microsatellite Analysis

The type and frequency of simple sequence repeats in Rosaceae unigene v5.0 contigs was determined using the program.For these searches, SSRs are defined as dinucleotides repeated at least 5 times, trinucleotides repeated at least 4 times, tetranucleotides repeated at least 3 times, or pentanucleotides repeated at least 3 times. The SSRs of Rosaceae unigene v5.0 contigs are available to be downloaded from the Downloads.

Sequence information
Number of Sequences 33,916
Number of Sequences Having One Or More SSRs 9,190
Percentage of Sequences Having One Or More SSRs 27.10%
Total Number of SSRs Found 12,682
Number of Motifs 543


Frequency of Motif Type

Motif Length Frequency Percentage Frequency
2bp 5166 40.73%
3bp 5208 41.07%
4bp 1786 14.08%
5bp 522 4.12%


Contact Details
Name Main, Dorrie
Lab Department of Horticulture
Organization Washington State University
Address 45 Johnson Hall, Pullman, WA 99164
Telephone 509-335-2774
Fax 509-335-8690


Orignal EST sequences from NCBI (518,586 sequences) Rosaceae_est_NCBI_091212.fasta
Filtered and trimmed EST sequences (503,851 sequences) Rosaceae.trimmedESTs.fasta
Contigs from CAP3 assembly (33,916 contig sequences) Rosaceae.cap.contigs
Ace file from CAP3 assembly Rosaceae.cap.ace
SSRs found in contigs (with primer predictions) Rosaceae_ssrReport.xls

Rosaceae unigene v5.0 contigs blastx vs protein databases. Best hit reports in Excel

BLAST of contigs to UniProtKB/Swiss-Prot (2.8MB) Rosaceae_blastx_Swiss-Prot.xlsx
BLAST of contigs to UniProtKB/TrEMBL (3.6MB) Rosaceae_blastx_TrEMBL.xlsx
BLAST of contigs to TAIR10 Arabidopsis proteins (3.6MB) Rosaceae_blastx_TAIR10.xlsx
BLAST of contigs to Prunus persica (peach) v1.0 proteins (3.2MB) Rosaceae_blastx_peach.xlsx
BLAST of contigs to Vitis vinifera (grape) proteins (3.1MB) Rosaceae_blastx_grape.xlsx
BLAST of contigs to Populus trichocarpa (poplar) v2.0 proteins (3.2MB) Rosaceae_blastx_poplar.xlsx



KEGG analysis of Rosaceae unigene v5.0 contigs

All Rosaceae unigene v5.0 contigs were uploaded to the KEGG / KASS server at The SBH (single-directional best hit) method was selected under the category "Assignment method". All other settings were defaults. Results were downloaded in the heir.tar.gz heirarchy file and uploaded to the website.