Description of Sequence Datasets in GDR

Curated Genes:
- GDR Gene Database: A single non-redundant list of Rosaceae genes with gene symbols. Majority of them are parsed out from NCBI nucleotide database but it includes some user-contributed data. Multiple gene sequences, from different sources, are associated with their respective genes from the GDR Gene Database.
- NCBI Rosaceae gene and mRNA sequences: All gene and mRNA sequences parsed out from NCBI nucleotide database. Gene and mRNA sequences from NCBI for all species of Prunus, Malus, Fragaria, Pyrus and Rubus are anchored to the P. persica genome v1.0, Malus × domestica genome v1.0p, F. vesca genome v1.1, Pyrus communis genome v1.0 and Rubus occidentalis genome v1.0, respectively, using blat with criteria of >98% PID and >95% Aligned Length.
Predicted Genes: Genes and mRNAs from whole genome assemblies. Additional annotation of these predicted genes by the GDR team includes computational annotation with homology to genes of closely related or plant model species and assignment of InterPro protein domains, GO terms and Kyoto Encyclopedia of Genes and Genomes database (KEGG) pathway and ortholog terms.
RefTrans: RefTrans combines published RNA-Seq and EST data sets to create a reference transcriptome (RefTrans) for each genus and provides putative gene function identified by homology to known proteins.
Unigene: EST contigs for the Rosaceae family and each genus, constructed from the publicly available Rosaceae ESTs downloaded from dbEST at NCBI. Unigene construction for the GDR occurs in four steps: (i) sequence filtering and trimming to obtain high-quality sequences, (ii) assembly into contigs to reduce the inherent redundancy, (iii) building unigene sets from the combined contigs and singlets and (iv) sequence annotation.
Other