Prunus persica Whole Genome Assembly v2.0 & Annotation v2.1 (v2.0.a1)

Overview


Analysis Name	Prunus persica Whole Genome Assembly v2.0 & Annotation v2.1 (v2.0.a1)
Method	Arachne
Source	Sanger reads
Date performed	2015-01-15

For use in publications, please CITE the papers below:

Verde I, Jenkins J, Dondini L, Micali S, Pagliarani G, Vendramin E, Paris R, Aramini V, Gazza L, Rossini L, Bassi D, Troggio M, Shu S, Grimwood J, Tartarini S, Dettori MT, Schmutz J (2017) The Peach v2.0 release: high-resolution linkage mapping and deep resequencing improve chromosome-scale assembly and contiguity. BMC Genomics 18:225 DOI: 10.1186/s12864-017-3606-9

The International Peach Genome Initiative (2013). The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat Genet 45, 487-494 (2013) doi:10.1038/ng.2586

And cite the version (Peach v2.0.a1 (v2.1)) and any URL below.

Phytozome (URL: phytozome.jgi.doe.gov),
GDR (URL: https://www.rosaceae.org/species/prunus_persica/genome_v2.0.a1)
IGA (URL: http://services.appliedgenomics.org/fgb2/iga/prunus_persica_v2/gbrowse/prunus_persica_v2/)

About the Assembly

Overview

The peach genome sequencing project was initiated in 2008 by the International Peach Genome Initiative, an International consortium led by Italian and US scientists (Ignazio Verde, Albert Abbott, Jeremy Schmutz, Michele Morgante and Daniel Rokhsar). The first version (Peach v1.0) was released under Fort Lauderdale Agreement on April 2010 and the results were published on 2013 on Nature Genetics).

Peach v2.0.a1 was generated from DNA from the doubled haploid cultivar 'Lovell' (PLOV2-2N) which means that the genes and intervening DNA is "fixed" or identical for all alleles and both chromosomal copies of the genome. This doubled haploid nature has facilitated a highly accurate and consistent assembly of the peach genome.

Peach v2.0.a1 currently consists of 8 pseudomolecules representing the 8 chromosomes of peach, and are numbered according to their corresponding linkage groups. The genome sequencing consisted of approximately 8.47 fold whole genome shotgun sequencing employing the accurate Sanger methodology and was assembled using Arachne.

This new release (Peach v2.0.a1) aims at improving several issues such as the chromosome-scale assembly, and the annotation of the repeated and gene sequences.

The peach v1.0 assembly was improved using large community molecular mapping data obtained on three linkage maps. 7.3 Mb of previously unmapped sequences (11 scaffolds) were integrated within the eight peach pseudomolecules and nine randomly oriented scaffolds (20 Mb) were correctly disposed. The use of a large mapping dataset has also allowed to fix seven regions (12.2 Mb) incorrectly positioned along the pseudomolecules due to misassembly issues. As a result of these mapping efforts, the peach v2.0 has now an outstanding 99.2% of mapped sequences with 97.9% oriented.

The base accuracy and contiguity were improved using contigs generated by an ABySS assembly of WGS Illumina reads (42x of 2x250 bp, 600 bp insert). Advancements include the correction of homozygous SNPs (859) and indels (1347) as well as minor assembly gaps (212 gaps closed with a gain of 25,199 bp). As a result, the contiguity of the Peach v2.0 was increased to a contig L50 of 255.4 kb (214.2 kb in Peach v1.0) and a contig N50 of 250 (294 in Peach v1.0).

The annotation of the repeated fraction was also enhanced including low copy repeats and the complete sequence and location of 1,157 non-autonomous Helitrons.

Gene prediction and annotation were upgraded using transcript assemblies obtained from 2.2 billion of RNA seq reads from different peach tissues and organs. In total, after masking with the advanced repeats annotation, 26,873 protein-coding genes were predicted in the Peach v2.1 annotation, 991 less than those predicted in Peach v1.0. Gene annotation was highly enhanced with the prediction of almost 20,000 new isoforms.

Statistics

This release of Phytozome includes the JGI v2.1 gene annotation of assembly v2.0. 225.7 Mb arranged in 8 pseudomolecules, with a small additi onal amount of mostly repetitive sequences in unmapped scaffolds

Genome Size

Approximately 227.4 Mb arranged in 191 scaffolds

Approximately 224.6 Mb arranged in 2,525 contigs (~ 1.2% gap)

Scaffold N50 (L50) = 4 (27.4 Mbp)

Contig N50 (L5) = 250 (255.4 Kbp)

11 scaffolds larger than 50 Kbp, with 99.4% of the genome in scaffolds larger than 50 Kbp

Loci

26,873 loci containing protein-coding genes

Transcripts

47,089 protein-coding transcripts

Sequencing, Assembly, and Annotation

Gene Prediction and Locus Naming

Short reads (~1B single ends and ~1.2B paired ends Illumina RNA-seq in various length ranging from 75 BP to 100 BP, and 3M 454) from various labs around the globe were used to construct transcript assembles (TAs) (Shu et. al., manuscript in preparation). 106,848 transcript assemblies were constructed using PASA (Haas, 2003) from 383,498 sequences in total, consisting of the TAs above, as well as Sanger ESTs, and 23,448 transcript assemblies from related species ESTs (424,656 sequences). Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabidopsis (Arabidopsis thaliana), rice, grape, soybean and Swiss-Prot eukaryote proteins to soft-repeatmasked Prunus persica genome using RepeatMasker (Smit, 1996-2012) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001).

The highest scoring predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed.

References:

Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. http://nar.oupjournals.org/cgi/content/full/31/19/5654 [Nucleic Acids Res, 31, 5654-5666].

Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011 .

Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.

Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.

Locus name and transcript name mapping from previous annotation version

The locus model name of a v1.0 gene is mapped to a corresponding v2.1 gene as alias if 1) the v1.0 and v2.1 loci overlap uniquely and appear on the same chromosome, and 2) at least one pair of translated transcripts from the old and new loci are MBH's (mutual best hits) with at least 70% normalized identity in a BLASTP alignment (normalized identity defined as the number of identical residues divided by the longer sequence). 77.38% v1.0 loci are mapped.

Contacts

Principal Collaborators:

Ignazio Verde, Consiglio per la Ricerca e la Sperimentazione in Agricoltura (email: ignazio DOT verde AT entecra DOT it)

JGI Contacts:

Daniel Rokhsar (email: dsrokhsar AT gmail.com)
Jeremy Schmutz (email: jschmutz AT hudsonalpha DOT org)

IGA Contacts:

Michele Morgante (email: michele DOT morgante AT uniud.it)
Simone Scalabrin (email: sscalabrin@igatechnology.com)

GDR contact: Dorrie Main (WSU) (email: dorrie AT wsu DOT edu)

Associated Publications

International Peach Genome Initiative, Verde I, Abbott AG, Scalabrin S, Jung S, Shu S, Marroni F, Zhebentyayeva T, Dettori MT, Grimwood J, Cattonaro F, Zuccolo A, Rossini L, Jenkins J, Vendramin E, Meisel LA, Decroocq V, Sosinski B, Prochnik S, Mitros T, Policriti A, Cipriani G, Dondini L, Ficklin S, Goodstein DM, Xuan P, Del Fabbro C, Aramini V, Copetti D, Gonzalez S, Horner DS, Falchi R, Lucas S, Mica E, Maldonado J, Lazzari B, Bielenberg D, Pirona R, Miculan M, Barakat A, Testolin R, Stella A, Tartarini S, Tonutti P, Arús P, Orellana A, Wells C, Main D, Vizzotto G, Silva H, Salamini F, Schmutz J, Morgante M, Rokhsar DS, The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution., Nature genetics. 2013 May ; 45 5 487-94

Prunus persica annotation v2.1 on assembly v2.0 (v2.0.a1)

Transcripts

Primary transcripts (loci)	26,873
Alternative transcripts	20,216
Total transcripts	47,089

For Primary transcripts:

Average number of exons	5.2
Median exon length	171
Median intron length	165

Gene model support (value is number of gene models):

Any EST support	21,956
EST support over 100% of their lengths	20,492
EST support over 95% of their lengths	20,841
EST support over 90% of their lengths	20,984
EST support over 75% of their lengths	21,220
EST support over 50% of their lengths	21,497
Peptide homology coverage of 100%	1,938
Peptide homology coverage of over 95%	16,687
Peptide homology coverage of over 90%	19,326
Peptide homology coverage of over 75%	21,790
Peptide homology coverage of over 50%	23,564
Pfam annotation	20,327
Panther annotation	19,938
KOG annotation	11,376
KEGG Orthology annotation	3,877
E.C. number annotation	2,063

Homology

Homology of the Prunux persica v2.0.a1 transcripts was determined by pairwise sequence comparison using the blastx algorithm against various protein databases. The results are available for download in Excel format. An expectation value cutoff less than 1e^-6was used for arabidoposis proteins and 1e^-9 for the NCBI nr, Uniprot SwissProt, and Uniprot TrEMBL databases.

Protein Homologs

predicted gene functions	Prunus_persica_v2.0.a1_predicted_gene_functions.xlsx
44,533 peach gene transcripts with NCBI nr homologs	Prunus_persica_v2.0.a1_vs_nr.xlsx
42,335 peach gene transcripts with Arabidopsis homologs	Prunus_persica_v2.0.a1_vs_arabidopsis.xlsx
34,594 peach gene transcripts with Swiss-Prot homologs	Prunus_persica_v2.0.a1_vs_swissprot.xlsx
44,499 peach gene transcripts with TrEMBL homologs	Prunus_persica_v2.0.a1_vs_trembl.xlsx

Download

All assembly and annotation files are available for download by selecting the desired data type in the left-hand side bar. Each data type page will provide a description of the available files and links to download.

Assembly

The Prunus persica v2.0.a1 genome assembly files are available in FASTA and GFF3 formats. There are a total of 8 pseudomolecules and 183 scaffolds in this assembly of peach.

Downloads

Pseudomolecules & Scaffolds (FASTA file)	Prunus_persica_v2.0.a1_scaffolds.fasta.gz
Pseudomolecules & Scaffolds (GFF3 file)	Prunus_persica_v2.0.a1_scaffolds.gff3.gz

Gene Predictions

The Prunus persica v2.0.a1 genome gene prediction files are available in FASTA and GFF3 formats.

Downloads

All transcript CDS sequences (FASTA file)	Prunus_persica_v2.0.a1.allTrs.cds.fa.gz
All peptide sequences (FASTA file)	Prunus_persica_v2.0.a1.allTrs.pep.fa.gz
Primary transcript sequences (FASTA file)	Prunus_persica_v2.0.a1.primaryTrs.fa.gz
Primary transcripts (GFF3 file)	Prunus_persica_v2.0.a1.primaryTrs.gff3.gz
Primary transcript CDS sequences (FASTA file)	Prunus_persica_v2.0.a1.primaryTrs.cds.fa.gz
Primary transcript protein sequences (FASTA file)	Prunus_persica_v2.0.a1.primaryTrs.pep.fa.gz
Alternative transcript sequences (FASTA file)	Prunus_persica_v2.0.a1.altTrs.fa.gz
Alternative transcripts (GFF3 file)	Prunus_persica_v2.0.a1.altTrs.gff3.gz
Genes, CDS, 5' UTR, 3'UTR locations (GFF3 file)	Prunus_persica_v2.0.a1.gene.gff3.gz
Genes, CDS, 5' UTR, 3'UTR locations (GFF3 file)	Prunus_persica_v2.0.a1.gene_2015Update.gff3.gz

Conversion table between v2.0.a1 genes and v1.0 genes

Functional Analysis

Functional annotation for the Prunus persica v2.0.a1 genome are available for download below. The peach proteins were analyzed using InterProScan in order to assign InterPro domains, Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).

Downloads

Gene functions annotated by Pfam2go/Interpro	Prunus_persica_v2.0.a1_gene_functions.txt.gz
GO assignments from InterProScan	Prunus_persica_v2.0.a1_genes2GO.txt.gz
IPR assignments from InterProScan	Prunus_persica_v2.0.a1_genes2IPR.txt.gz
KEGG Hierarchy file (for viewing with KegHeir)	Prunus_persica_v2.0.a1_KEGG-hier.tar.gz
Proteins mapped to KEGG Orthologs	Prunus_persica_v2.0.a1_KEGG-orthologis.txt.gz
Proteins mapped to KEGG Pathways	Prunus_persica_v2.0.a1_KEGG-pathways.txt.gz

SNPs

The IRSC (International Rosaceae Sequencing Consortium) 9K SNPs mapped to the Peach Genomve v2.0.a (Refer to the Prunus persica Whole Genome v1.0 Assembly & Annotation detail page for original information about the SNPs.)

IRSC 9K peach SNPs	IRSC_9K_peach_SNP_array_Peach_v2.0.a1.xls
IRSC 6K cherry SNPs	Campoy_2016_BMC_PB_TableS2.xlsx (Campoy et al. 2016)

IRSC 9K peach SNPs anchored to Prunus Persica whole genome v1.0 assembly

IRSC 16K SNP array and 18K candidate SNPs

IRSC 16K peach SNPs	16K_peach_array.xlsx
IRSC 18 K peach candidate SNPs	IRSC_18K_peach_candidate_SNP.xlsx

Markers

The alignment tool 'BLAT' was used to map Prunus genetic marker sequences to the Peach genome v2.0.a1. Markers required 90% identity over 97% of their length. For SSRs & RFLPs, the gap size was restricted to 1000 bp or less with less than 2 gaps. The available files are in Fasta and GFF3 format. You can also find original information about these markers on the Prunus persica Whole Genome v1.0 Assembly & Annotation detail page.

Downloads

Prunus all genetic marker sequences (FASTA file)	Prunus_persica_v2.0.a1_genetic_markers.fasta.gz
Prunus all genetic markers mapped to Peach v2.0.a1 (GFF3 file)	Prunus_persica_v2.0.a1_genetic_markers.gff3.gz
Prunus SSR marker sequences (FASTA file)	Prunus_persica_v2.0.a1_ssr_markers.fasta.gz
Prunus SSR markers mapped to Peach v2.0.a1 (GFF3 file)	Prunus_persica_v2.0.a1_ssr_markers.gff3.gz

Links

View Assembly in JBrowse

Blast