Malus Unigene v3

Analysis NameMalus Unigene v3
SourceGenbank Malus ESTs (Aug 1, 2006)
Date performed2006-08-01

Many sequencing projects around the world are depositing ESTs from the genus Malus in the NCBI dbEST database. All the Malus ESTs from GenBank on July 14, 2006 were included in this assembly. However, not all of these ESTs are of high quality. To filter, we crossmatched the public sequences against NCBI's UniVec database and used the BLAST sequence similarity algorithm to remove species-specific chloroplast, mitochondrial, tRNA, and rRNA sequences.

To reduce redundancy and create longer transcripts we assembled these ESTs using the CAP3 1 program. The parameters used for CAP3 were -p 90. CAP3 outputs assembled contigs and singlets. The number of tentative unigenes for this assembly is comprised of the combined contigs and singlets.  For some sequences, we were able to obtain the original trace files and incorporate the phred quality values for each base into the assembly. The final assembly has been annotated by BLAST sequence similarity searching 2 against Swiss-Prot 3, TrEMBL 3, and TAIR 4's Arabidopsis proteins.

For more information on this project please contact the GDR development team.

 Processing Summary
 Number of ESTs available  254,087
 Number of ESTs available after filtering  250,907
 Average Length  583.3
 Number of Contigs(CAP3 Assembly, -p 90 )  23,868
 Average Length of Contigs  850
 Number of Singlets  58,982
 Number of Putative Unigenes  82,850



  1. Huan, X. and Madan, A. (1999). CAP3: A DNA sequence assembly program. Genome Research, 9, 868-877.
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol. 215(3):403-10.
  3. Boeckmann B., Bairoch A., Apweiler R., Blatter M.-C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I., Pilbout S., and Sneider M. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research. 31:365-370.
  4. Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P. (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gate way to Arabidopsis biology, research materials and community. Nucleic Acids Res.. 31(1):224-8.


Library Information
The Malus ESTs used for this assembly were downloaded on June 14th, 2006


 EST Libraries
 Number of ESTs available  250907
 # of Species  4
 # of Libraries  97
 # of Tissues  18
 # of Development Stages  33

View detailed chart of libraries.

 Malus hybrid rootstock  320
 Malus sieboldii  1126
 Malus x domestica  245545
 Malus x domestica x Malus sieversii  3916



Homology was determined using the BLASTx algorithm for the Malus Contigs and Singlets vs. the Swiss-Prot and TrEMBL databases. Only matches with an E-value of 1.0 e-9 or better were recorded. Swiss-Prot is a curated protein database with a high level of annotation and a minimal level of redundancy, and TrEMBL is a computer-annotated supplement of Swiss-Prot that contains all the translations of TrEMBL nucleotide sequence entries not yet integrated in Swiss-Prot.

 Homology of Malus Contigs
 Number of Contigs  23868
 Number (%) of Contigs with a Match in Swiss-Prot Database
 View as HTML | Download Excel Spreadsheet | Search
 13340 (55.9%)
 Number (%) of Contigs with a Match in TrEMBL Database
 Download Excel Spreadsheet
 20564 (86.2%)


 Homology of Malus Singlets
 Number of Singlets  58982
 Number (%) of Singlets with a Match in Swiss-Prot Database
 Download Excel Spreadsheet
 21842 (37.0%)
 Number (%) of Singlets with a Match in TrEMBL Database
 View as HTML | Download Excel Spreadsheet | Search
 37101 (62.9%)


Microsatellite Analysis

The type and frequency of simple sequence repeats in this unigene assembly (v3) were determined using the program. For these searches, SSRs are defined as dinucleotides repeated at least 5 times, trinucleotides repeated at least 4 times, tetranucleotides repeated at least 3 times, or pentanucleotides repeated at least 3 times.


 Sequence information
 Number of Sequences  250907
 Number of Sequences Having One Or More SSRs  46663
 Percentage of Sequences Having One Or More SSRs  18.6%
 Total Number of SSRs Found  58319
 Number of Motifs  657


Frequency of Motif Type


 Motif Length  Frequency  Percentage Frequency
 2bp  27350  46.9%
 3bp  21818  37.4%
 4bp  7437  12.8%
 5bp  1714  2.9%


 Contact Details
 Name  Main, Dorrie
 Lab  Department of Horticulture
 Organization  Washington State University
 Address  45 Johnson Hall, Pullman, WA 99164
 Telephone  509-335-2774
 Fax  509-335-8690



No publications are currently available.

Contig GO Terms

The GO Terms ( were determined by comparing the contigs against Swiss-Prot using BLAST. The Sprot2GO annotation file was then used to map go terms to the sequences using relevant matches (1e-9).


6492 Contigs have Biological Process annotation:

7570 Contigs have Cellular Component annotation:

9322 Contigs have Molecular Function annotation:




GO Term Serach






ESNP Summary

The type and frequency of single nucleotide polymorphisms in this unigene assembly (V3) were determined using the AutoSNP software package (Savage et al., 2005).

View autoSNP output:


 SNP Summary
 Number of Contigs  23868
 Number of SNPs  14298
 Consensus Size  20360530 bp
 SNP Frequency  0.07/100 bp
 Total Transistions  7060
 Total Transversions  3836
 Total Indels  3402


Sequence Files:
Blast Result Files:
Microsatellite Files: