Prunus persica Whole Genome Assembly v2.0 & Annotation v2.1 (v2.0.a1)

Overview
Analysis NamePrunus persica Whole Genome Assembly v2.0 & Annotation v2.1 (v2.0.a1)
MethodArachne
SourceSanger reads
Date performed2015-01-15

For use in publications, please CITE the papers below:

Verde I, Jenkins J, Dondini L, Micali S, Pagliarani G, Vendramin E, Paris R, Aramini V, Gazza L, Rossini L, Bassi D, Troggio M, Shu S, Grimwood J, Tartarini S, Dettori MT, Schmutz J (2017) The Peach v2.0 release: high-resolution linkage mapping and deep resequencing improve chromosome-scale assembly and contiguity. BMC Genomics 18:225 DOI: 10.1186/s12864-017-3606-9

The International Peach Genome Initiative (2013). The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat Genet 45, 487-494 (2013) doi:10.1038/ng.2586

 
And cite the version (Peach v2.0.a1 (v2.1)) and any URL below.
 

About the Assembly

Overview

The peach genome sequencing project was initiated in 2008 by the International Peach Genome Initiative, an International consortium led by Italian and US scientists (Ignazio Verde, Albert Abbott, Jeremy Schmutz, Michele Morgante and Daniel Rokhsar). The first version (Peach v1.0) was released under Fort Lauderdale Agreement on April 2010 and the results were published on 2013 on Nature Genetics).

Peach v2.0.a1 was generated from DNA from the doubled haploid cultivar 'Lovell' (PLOV2-2N) which means that the genes and intervening DNA is "fixed" or identical for all alleles and both chromosomal copies of the genome. This doubled haploid nature has facilitated a highly accurate and consistent assembly of the peach genome.

Peach v2.0.a1 currently consists of 8 pseudomolecules representing the 8 chromosomes of peach, and are numbered according to their corresponding linkage groups. The genome sequencing consisted of approximately 8.47 fold whole genome shotgun sequencing employing the accurate Sanger methodology and was assembled using Arachne.

This new release (Peach v2.0.a1) aims at improving several issues such as the chromosome-scale assembly, and the annotation of the repeated and gene sequences.

The peach v1.0 assembly was improved using large community molecular mapping data obtained on three linkage maps. 7.3 Mb of previously unmapped sequences (11 scaffolds) were integrated within the eight peach pseudomolecules and nine randomly oriented scaffolds (20 Mb) were correctly disposed. The use of a large mapping dataset has also allowed to fix seven regions (12.2 Mb) incorrectly positioned along the pseudomolecules due to misassembly issues. As a result of these mapping efforts, the peach v2.0 has now an outstanding 99.2% of mapped sequences with 97.9% oriented.
The base accuracy and contiguity were improved using contigs generated by an ABySS assembly of WGS Illumina reads (42x of 2x250 bp, 600 bp insert). Advancements include the correction of homozygous SNPs (859) and indels (1347) as well as minor assembly gaps (212 gaps closed with a gain of 25,199 bp). As a result, the contiguity of the Peach v2.0 was increased to a contig L50 of 255.4 kb (214.2 kb in Peach v1.0) and a contig N50 of 250 (294 in Peach v1.0).
The annotation of the repeated fraction was also enhanced including low copy repeats and the complete sequence and location of 1,157 non-autonomous Helitrons.
Gene prediction and annotation were upgraded using transcript assemblies obtained from 2.2 billion of RNA seq reads from different peach tissues and organs. In total, after masking with the advanced repeats annotation, 26,873 protein-coding genes were predicted in the Peach v2.1 annotation, 991 less than those predicted in Peach v1.0. Gene annotation was highly enhanced with the prediction of almost 20,000 new isoforms.
 
Statistics
This release of Phytozome includes the JGI v2.1 gene annotation of assembly v2.0. 225.7 Mb arranged in 8 pseudomolecules, with a small additi onal amount of mostly repetitive sequences in unmapped scaffolds
 
Genome Size
Approximately 227.4 Mb arranged in 191 scaffolds
Approximately 224.6 Mb arranged in 2,525 contigs (~ 1.2% gap)
Scaffold N50 (L50) = 4 (27.4 Mbp)
Contig N50 (L5) = 250 (255.4 Kbp)
11 scaffolds larger than 50 Kbp, with 99.4% of the genome in scaffolds larger than 50 Kbp
Loci
26,873 loci containing protein-coding genes
Transcripts
47,089 protein-coding transcripts
 
Sequencing, Assembly, and Annotation
 
Gene Prediction and Locus Naming
Short reads (~1B single ends and ~1.2B paired ends Illumina RNA-seq in various length ranging from 75 BP to 100 BP, and 3M 454) from various labs around the globe were used to construct transcript assembles (TAs) (Shu et. al., manuscript in preparation). 106,848 transcript assemblies were constructed using PASA (Haas, 2003) from 383,498 sequences in total, consisting of the TAs above, as well as Sanger ESTs, and 23,448 transcript assemblies from related species ESTs (424,656 sequences). Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabidopsis (Arabidopsis thaliana), rice, grape, soybean and Swiss-Prot eukaryote proteins to soft-repeatmasked Prunus persica genome using RepeatMasker (Smit, 1996-2012) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001).
 
The highest scoring predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed.
 
References:
 
Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. http://nar.oupjournals.org/cgi/content/full/31/19/5654 [Nucleic Acids Res, 31, 5654-5666].
 
Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011 .
Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.
 
Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.
 
Locus name and transcript name mapping from previous annotation version
The locus model name of a v1.0 gene is mapped to a corresponding v2.1 gene as alias if 1) the v1.0 and v2.1 loci overlap uniquely and appear on the same chromosome, and 2) at least one pair of translated transcripts from the old and new loci are MBH's (mutual best hits) with at least 70% normalized identity in a BLASTP alignment (normalized identity defined as the number of identical residues divided by the longer sequence). 77.38% v1.0 loci are mapped.
 
Contacts
Principal Collaborators:
  • Ignazio Verde, Consiglio per la Ricerca e la Sperimentazione in Agricoltura (email: ignazio DOT verde AT entecra DOT it)
JGI Contacts:
  • Daniel Rokhsar (email: dsrokhsar AT gmail.com)
  • Jeremy Schmutz (email: jschmutz AT hudsonalpha DOT org)
IGA Contacts:
  • Michele Morgante (email: michele DOT morgante AT uniud.it)
  • Simone Scalabrin (email: sscalabrin@igatechnology.com)
GDR contact: Dorrie Main (WSU) (email: dorrie AT wsu DOT edu)
 
Associated Publications
International Peach Genome Initiative, Verde I, Abbott AG, Scalabrin S, Jung S, Shu S, Marroni F, Zhebentyayeva T, Dettori MT, Grimwood J, Cattonaro F, Zuccolo A, Rossini L, Jenkins J, Vendramin E, Meisel LA, Decroocq V, Sosinski B, Prochnik S, Mitros T, Policriti A, Cipriani G, Dondini L, Ficklin S, Goodstein DM, Xuan P, Del Fabbro C, Aramini V, Copetti D, Gonzalez S, Horner DS, Falchi R, Lucas S, Mica E, Maldonado J, Lazzari B, Bielenberg D, Pirona R, Miculan M, Barakat A, Testolin R, Stella A, Tartarini S, Tonutti P, Arús P, Orellana A, Wells C, Main D, Vizzotto G, Silva H, Salamini F, Schmutz J, Morgante M, Rokhsar DS, The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution., Nature genetics. 2013 May ; 45 5 487-94
 
Prunus persica annotation v2.1 on assembly v2.0 (v2.0.a1)
 
Transcripts
Primary transcripts (loci) 26,873
Alternative transcripts 20,216
Total transcripts 47,089

 For Primary transcripts:

Average number of exons 5.2
Median exon length 171
Median intron length 165

Gene model support (value is number of gene models):

Any EST support 21,956
EST support over 100% of their lengths 20,492
EST support over 95% of their lengths 20,841
EST support over 90% of their lengths 20,984
EST support over 75% of their lengths 21,220
EST support over 50% of their lengths 21,497
Peptide homology coverage of 100% 1,938
Peptide homology coverage of over 95% 16,687
Peptide homology coverage of over 90% 19,326
Peptide homology coverage of over 75%  21,790
Peptide homology coverage of over 50% 23,564
Pfam annotation 20,327
Panther annotation 19,938
KOG annotation 11,376
KEGG Orthology annotation 3,877
E.C. number annotation  2,063

 

Homology

Homology of the Prunux persica v2.0.a1 transcripts was determined by pairwise sequence comparison using the blastx algorithm against various protein databases. The results are available for download in Excel format. An expectation value cutoff less than 1e-6 was used for arabidoposis proteins and 1e-9  for the NCBI nr, Uniprot SwissProt, and Uniprot TrEMBL databases.

Protein Homologs

predicted gene functions Prunus_persica_v2.0.a1_predicted_gene_functions.xlsx
44,533 peach gene transcripts with NCBI nr homologs Prunus_persica_v2.0.a1_vs_nr.xlsx
42,335 peach gene transcripts with Arabidopsis homologs Prunus_persica_v2.0.a1_vs_arabidopsis.xlsx
34,594 peach gene transcripts with Swiss-Prot homologs Prunus_persica_v2.0.a1_vs_swissprot.xlsx
44,499 peach gene transcripts with TrEMBL homologs Prunus_persica_v2.0.a1_vs_trembl.xlsx

 

Download
All assembly and annotation files are available for download by selecting the desired data type in the left-hand side bar.  Each data type page will provide a description of the available files and links to download.
Assembly

The Prunus persica v2.0.a1 genome assembly files are available in FASTA and GFF3 formats.  There are a total of 8 pseudomolecules and 183 scaffolds in this assembly of peach.

Downloads

Pseudomolecules & Scaffolds (FASTA file)  Prunus_persica_v2.0.a1_scaffolds.fasta.gz
Pseudomolecules & Scaffolds (GFF3 file)  Prunus_persica_v2.0.a1_scaffolds.gff3.gz

 

Gene Predictions

The Prunus persica v2.0.a1 genome gene prediction files are available in FASTA and GFF3 formats.

Downloads

All transcript CDS sequences (FASTA file) Prunus_persica_v2.0.a1.allTrs.cds.fa.gz
All peptide sequences  (FASTA file) Prunus_persica_v2.0.a1.allTrs.pep.fa.gz
Primary transcript sequences (FASTA file) Prunus_persica_v2.0.a1.primaryTrs.fa.gz
Primary transcripts (GFF3 file) Prunus_persica_v2.0.a1.primaryTrs.gff3.gz
Primary transcript CDS sequences (FASTA file) Prunus_persica_v2.0.a1.primaryTrs.cds.fa.gz
Primary transcript protein sequences (FASTA file) Prunus_persica_v2.0.a1.primaryTrs.pep.fa.gz
Alternative transcript sequences (FASTA file) Prunus_persica_v2.0.a1.altTrs.fa.gz
Alternative transcripts (GFF3 file) Prunus_persica_v2.0.a1.altTrs.gff3.gz
Genes, CDS, 5' UTR, 3'UTR locations (GFF3 file) Prunus_persica_v2.0.a1.gene.gff3.gz
Genes, CDS, 5' UTR, 3'UTR locations (GFF3 file) Prunus_persica_v2.0.a1.gene_2015Update.gff3.gz

Conversion table between v2.0.a1 genes and v1.0 genes

Functional Analysis

Functional annotation for the Prunus persica v2.0.a1 genome are available for download below. The peach proteins were analyzed using InterProScan in order to assign InterPro domains, Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).

Downloads

Gene functions annotated by Pfam2go/Interpro Prunus_persica_v2.0.a1_gene_functions.txt.gz
GO assignments from InterProScan Prunus_persica_v2.0.a1_genes2GO.txt.gz
IPR assignments from InterProScan Prunus_persica_v2.0.a1_genes2IPR.txt.gz
KEGG Hierarchy file (for viewing with KegHeir) Prunus_persica_v2.0.a1_KEGG-hier.tar.gz
Proteins mapped to KEGG Orthologs Prunus_persica_v2.0.a1_KEGG-orthologis.txt.gz
Proteins mapped to KEGG Pathways Prunus_persica_v2.0.a1_KEGG-pathways.txt.gz

 

SNPs
IRSC 9K peach SNPs IRSC_9K_peach_SNP_array_Peach_v2.0.a1.xls
IRSC 6K cherry SNPs Campoy_2016_BMC_PB_TableS2.xlsx  (Campoy et al. 2016)

IRSC 9K peach SNPs anchored to Prunus Persica whole genome v1.0 assembly

  • IRSC 16K SNP array and 18K candidate SNPs
IRSC 16K peach SNPs 16K_peach_array.xlsx
IRSC 18 K peach candidate SNPs IRSC_18K_peach_candidate_SNP.xlsx

 

Markers

The alignment tool 'BLAT' was used to map Prunus genetic marker sequences to the Peach genome v2.0.a1. Markers required 90% identity over 97% of their length. For SSRs & RFLPs, the gap size was restricted to 1000 bp or less with less than 2 gaps. The available files are in Fasta and GFF3 format. You can also find original information about these markers on the Prunus persica Whole Genome v1.0 Assembly & Annotation detail page.

Downloads

Prunus all genetic marker sequences (FASTA file) Prunus_persica_v2.0.a1_genetic_markers.fasta.gz
Prunus all genetic markers mapped to Peach v2.0.a1 (GFF3 file) Prunus_persica_v2.0.a1_genetic_markers.gff3.gz
Prunus SSR marker sequences (FASTA file) Prunus_persica_v2.0.a1_ssr_markers.fasta.gz
Prunus SSR markers mapped to Peach v2.0.a1 (GFF3 file) Prunus_persica_v2.0.a1_ssr_markers.gff3.gz