Prunus dulcis Texas Genome v2.0

Analysis NamePrunus dulcis Texas Genome v2.0
MethodMaSuRCA (v3.2.3)
SourcePrunus dulcis Illumina PE libary and Oxford Nanopore reads
Date performed2018-10-11


These data are released in accordance with the Fort Lauderdale agreement and Toronto agreements by Pere Arús and Tyler Alioto. The producers of these data reserve the right to be the first to publish a genome-wide analysis of the data we have generated. Authors are a consortium of Spanish (CRG, IRTA, CRAG, CITA), Australian (University of Adelaide), US (Washington State University), and French (INRA) research organizations.


Almond is one of the oldest cultivated nut crops with its origin in central and western Asia. The selection of the sweet type (Prunus dulcis) distinguishes the domesticated almond from its bitter wild relatives. It is economically important, especially in California with the highest worldwide production, followed by Spain and Australia. The almond belongs to the same subgenus as the peach, for which there already exists a reference genome. However, to fully understand the genetic underpinnings marking the key phenotypic differences between almond and peach, we have sequenced the genome of the 'Texas' almond, one of the traditional cultivars producing a sweet nut.

Genome facts and statistics

Texas Almond v2.0 was assembled using two different data types: ~285x coverage of Illumina 2x100 paired-end reads (317 and 354 nt fragment sizes) and 40x coverage of long-reads (read N50>7kb) generated by several runs of the Oxford Nanopore Technologies MinION sequencer. First, contaminants were removed from all reads and the 25% longest nanopore reads (5.7 Gbp with N50 12.8 kb) were error-corrected with Racon and used for assembly with MaSuRCA. Haplotigs were removed using Redundans and further scaffolding was performed with SSPACE-LR. Comparison to the genetic map (T1E) and synteny with peach were used to identify assembly errors. Mappings of long reads to the genome was used to guide manual revisions of the assembly. The scaffolds were anchored to the genetic map with ALLPATHS. >95% of the assembly is anchored. Texas Almond v2.0 currently consists scaffolds (scaffold N50 381kb, contig 50 103kb) that have been placed into 8 pseudomolecules (superscaffolds) representing the 8 chromosomes of almond, and are numbered according to their corresponding linkage groups. 39.42% of the genome corresponds to repeats. BUSCO analysis of genome completeness indicates the assembly is over 96% complete. In total, we have annotated 27,042 protein-coding genes, which produce 33,119 transcripts (1.22 transcripts per gene) and encode for 31,654 unique protein products. We have been able to assign functional labels to 66.35% of the annotated proteins. The annotated transcripts contain 4.37 exons on average, with 80% of them being multi-exonic. In addition, 6,800 non-coding transcripts have been annotated, of which 3,604 and 3,196 are long and short non-coding RNA genes, respectively.


The Prunus dulcis Texas Genome v2.0 assembly files are available in FASTA and GFF3 formats.


Chromosome (FASTA file) P.dulcis_v2.0.fasta.gz
Chromosome masked (FASTA file) P.dulcis_v2.0.masked.fasta.gz
Chromosome Repeat Annotation (GFF3 file) P.dulcis_v2.0.repeat.annotation.gff3.gz


Gene Predictions

The Prunus dulcis Texas Genome v2.0 gene prediction files are available in FASTA and GFF3 formats.


Gene sequences (FASTA file) P.dulcis_v2.0.gene.fasta.gz
Gene models (GFF3 file) P.dulcis_v2.0_gene_models.gff3.gz
mRNA sequences (FASTA file) P.dulcis_v2.0_mRNA.fasta.gz
CDS sequences (FASTA file) P.dulcis_v2.0_CDs.fasta.gz
Protein sequences  (FASTA file) P.dulcis_v2.0_proteins.fasta.gz
TE sequences  (FASTA file) P.dulcis_v2.0_TEannotlib.fasta.gz



All annotation files are available for download by selecting the desired data type in the left-hand "Resources" side bar.  Each data type page will provide a description of the available files and links do download.  Alternatively, you can use the FTP repository for bulk download.


Homology of the Prunus dulcis Texas Genome v2.0 transcripts was determined by pairwise sequence comparison using the blastx algorithm against various protein databases. An expectation value cutoff less than 1e-9 was used for the NCBI nr (Release 2018-05) and 1e-6  for the Arabidoposis proteins (TAIR10), UniProtKB/SwissProt (Release 2018-04), and UniProtKB/TrEMBL (Release 2018-04) databases. The best hit reports are available for download in Excel format. 


Protein Homologs

Prunus dulcis v2.0 transcripts with NCBI nr homologs (EXCEL file) P.dulcis_v2.0_vs_nr.xlsx.gz
Prunus dulcis v2.0 transcripts with NCBI nr (FASTA file) P.dulcis_v2.0_vs_nr_hit.fasta.gz
Prunus dulcis v2.0 transcripts without NCBI nr (FASTA file) P.dulcis_v2.0_vs_nr_noHit.fasta.gz
Prunus dulcis v2.0 transcripts with arabidopsis (TAIR10) homologs (EXCEL file) P.dulcis_v2.0_vs_tair.xlsx.gz
Prunus dulcis v2.0 transcripts with arabidopsis (TAIR10) (FASTA file) P.dulcis_v2.0_vs_tair_hit.fasta.gz
Prunus dulcis v2.0 transcripts without arabidopsis (TAIR10) (FASTA file) P.dulcis_v2.0_vs_tair_noHit.fasta.gz
Prunus dulcis v2.0 transcripts with SwissProt homologs (EXCEL file) P.dulcis_v2.0_vs_swissprot.xlsx.gz
Prunus dulcis v2.0 transcripts with SwissProt (FASTA file) P.dulcis_v2.0_vs_swissprot_hit.fasta.gz
Prunus dulcis v2.0 transcripts without SwissProt (FASTA file) P.dulcis_v2.0_vs_swissprot_noHit.fasta.gz
Prunus dulcis v2.0 transcripts with TrEMBL homologs (EXCEL file) P.dulcis_v2.0_vs_trembl.xlsx.gz
Prunus dulcis v2.0 transcripts with TrEMBL (FASTA file) P.dulcis_v2.0_vs_trembl_hit.fasta.gz
Prunus dulcis v2.0 transcripts without TrEMBL (FASTA file) P.dulcis_v2.0_vs_trembl_noHit.fasta.gz


Functional Analysis

Functional annotation files for the Prunus dulcis Texas Genome v2.0 are available for download below. The Prunus dulcisTexas Genome proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).


GO assignments from InterProScan P.pdulcis_v2.0_genes2GO.xlsx.gz
IPR assignments from InterProScan P.pdulcis_v2.0_genes2IPR.xlsx.gz
Proteins mapped to KEGG Pathways P.pdulcis_v2.0_KEGG-orthologis.xlsx.gz
Proteins mapped to KEGG Orthologs P.pdulcis_v2.0_KEGG-pathways.xlsx.gz