Prunus dulcis Texas Genome v2.0
Overview
Publication Background Almond is one of the oldest cultivated nut crops with its origin in central and western Asia. The selection of the sweet type (Prunus dulcis) distinguishes the domesticated almond from its bitter wild relatives. It is economically important, especially in California with the highest worldwide production, followed by Spain and Australia. The almond belongs to the same subgenus as the peach, for which there already exists a reference genome. However, to fully understand the genetic underpinnings marking the key phenotypic differences between almond and peach, we have sequenced the genome of the 'Texas' almond, one of the traditional cultivars producing a sweet nut. Genome facts and statistics Texas Almond v2.0 was assembled using two different data types: ~285x coverage of Illumina 2x100 paired-end reads (317 and 354 nt fragment sizes) and 40x coverage of long-reads (read N50>7kb) generated by several runs of the Oxford Nanopore Technologies MinION sequencer. First, contaminants were removed from all reads and the 25% longest nanopore reads (5.7 Gbp with N50 12.8 kb) were error-corrected with Racon and used for assembly with MaSuRCA. Haplotigs were removed using Redundans and further scaffolding was performed with SSPACE-LR. Comparison to the genetic map (T1E) and synteny with peach were used to identify assembly errors. Mappings of long reads to the genome was used to guide manual revisions of the assembly. The scaffolds were anchored to the genetic map with ALLPATHS. >95% of the assembly is anchored. Texas Almond v2.0 currently consists scaffolds (scaffold N50 381kb, contig 50 103kb) that have been placed into 8 pseudomolecules (superscaffolds) representing the 8 chromosomes of almond, and are numbered according to their corresponding linkage groups. 39.42% of the genome corresponds to repeats. BUSCO analysis of genome completeness indicates the assembly is over 96% complete. In total, we have annotated 27,042 protein-coding genes, which produce 33,119 transcripts (1.22 transcripts per gene) and encode for 31,654 unique protein products. We have been able to assign functional labels to 66.35% of the annotated proteins. The annotated transcripts contain 4.37 exons on average, with 80% of them being multi-exonic. In addition, 6,800 non-coding transcripts have been annotated, of which 3,604 and 3,196 are long and short non-coding RNA genes, respectively. Homology
Homology of the Prunus dulcis Texas Genome v2.0 transcripts was determined by pairwise sequence comparison using the blastx algorithm against various protein databases. An expectation value cutoff less than 1e-9 was used for the NCBI nr (Release 2018-05) and 1e-6 for the Arabidoposis proteins (TAIR10), UniProtKB/SwissProt (Release 2018-04), and UniProtKB/TrEMBL (Release 2018-04) databases. The best hit reports are available for download in Excel format.
Protein Homologs
Downloads
All annotation files are available for download by selecting the desired data type in the left-hand side bar. Each data type page will provide a description of the available files and links do download. Assembly
The Prunus dulcis Texas Genome v2.0 assembly files are available in FASTA and GFF3 formats. Downloads
Gene Predictions
The Prunus dulcis Texas Genome v2.0 gene prediction files are available in FASTA and GFF3 formats. Downloads
Functional Analysis
Functional annotation files for the Prunus dulcis Texas Genome v2.0 are available for download below. The Prunus dulcisTexas Genome proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS). Downloads
|