Prunus cerasus cv. 'Montmorency' Whole Genome v1.0.a1 Assembly & Annotation

Overview


Analysis Name	Prunus cerasus cv. 'Montmorency' Whole Genome v1.0.a1 Assembly & Annotation
Method	Assembler: Canu, Polishing: Pilon, Scaffolding: Juicer, 3D-DNA, & JuiceBox Assembly Tools
Source	PacBio long-reads & Illumina short-reads
Date performed	2023-01-10

Publication:

Goeckeritz, C. Z., Rhoades, K. E., Childs, K. L., Iezzoni, A. F., VanBuren, R., & Hollender, C. A. (2023). Genome of tetraploid sour cherry (Prunus cerasus L.) 'Montmorency' identifies three distinct ancestral Prunus genomes. Horticulture Research, 2023;, uhad097, DOI: https://doi.org/10.1093/hr/uhad097

Overview

Three collections of young leaf tissue were made in the spring of 2019 for PacBio sequencing, Illumina sequencing of a gDNA library, and Illumina sequencing of a Hi-C library. Canu was used to assemble approximately 100X PacBio long-reads into contigs. Subsequently, 56X coverage of Illumina 150bp paired-end gDNA reads were aligned to the assembly using bowtie2, and these alignments were used for polishing the initial assembly with Pilon. After polishing, the assembly was scaffolded using Illumina 150bp paired-end Hi-C reads and the Juicer + 3D-DNA pipeline. Misassemblies in the scaffolded chromosome-scale sequences were corrected in JuiceBox Assembly Tools. For annotation, Illumina 150bp paired-end RNA-sequencing was conducted on a variety of tissues from cultivar 'Montmorency' at different developmental time points to aid in gene discovery. In addition to the RNA-seq, Nanopore cDNA-sequencing was conducted on a diverse pool of tissues in an attempt to obtain full-length transcripts for annotation. Manually-curated protein databases were downloaded from Uniprot and arabidopsis.org and aligned to the assembly with Exonerate to make homology-based gene predictions. The RNA-seq and cDNA-sequencing data were aligned to the assembly using STAR and minimap2, respectively, and transcriptomes from both types of data were created using Stringtie. The two transcriptomes and protein alignments were given to MAKER for evidence-based gene predictions. After MAKER's first run, the resulting .gff3 was used to train gene finders Augustus and SNAP. MAKER was run a second time using these evidence-trained gene finders and the subsequent gene predictions were filtered so that each prediction contained at least one known Pfam domain. Then, deFusion was run to identify erroneously fused gene candidates, and more than 2500 gene predictions on the 16 main chromosomes of the assembly were defused as a result of this pipeline. At this point, predictions were again filtered to contain Pfam domains but also to exclude predictions with homology to transposeable elements. Lastly, Apollo was used to manually annotate the intron-exon structure of 67 genes, including all identified DAM (Dormancy Associated MADS-box) genes.

Summary of the ‘Montmorency’ assembly metrics

Estimated haploid genome size (kmer analysis; k = 25)					621 Mb
Estimated Heterozygosity (total)					4.90%
class	aaab	aabb	aabc	abcd
class	2.430%	2.060%	0.001%	0.451%
Full assembly size					1066 Mb
Scaffolded assembly size					771.8 Mb
NG50					11.56 Mb
Number of contigs					3592
Linkage Groups					24
BUSCO (Viridiplantae db10)	complete	singletons	duplicates	missing	fragmented
Scaffolded assembly (24 LGs)	98.6%	5.4%	93.2%	0.9%	0.5%
subgenome A	91.5%	89.6%	1.9%	6.4%	2.1%
subgenome A'	94.8%	92.9%	1.9%	4.0%	1.2%
subgenome B	90.8%	88.2%	2.6%	7.3%	1.9%
Estimated % Repeats (Full Assembly)	LTR	TIR	Helitron	Total
Estimated % Repeats (Full Assembly)	35.6%	11.6%	1.3%	48.5%
LTR Assembly Index	Full assembly (incl. unanchored)				14.74
LTR Assembly Index	Scaffolded assembly				17.09

Homology

Homology of the P. cerasus cv. Montmorency Montmorency Genome v1.0.a1 proteins was determined by pairwise sequence comparison using the blastp algorithm against various protein databases. An expectation value cutoff less than 1e^-9was used for the NCBI nr (Release 2021-09) and 1e^-6 for the Arabidoposis proteins (Araport11), UniProtKB/SwissProt (Release 2021-09), and UniProtKB/TrEMBL (Release 2021-09) databases. The best hit reports are available for download in Excel format.

Protein Homologs

P. cerasus cv. Montmorency v1.0.a1 proteins with NCBI nr homologs (EXCEL file)	Pcerasus_Montmorency_v1.0.a1.a1_vs_nr.xlsx.gz
P. cerasus cv. Montmorency v1.0.a1 proteins with NCBI nr (FASTA file)	Pcerasus_Montmorency_v1.0.a1.a1_vs_nr_hit.fasta.gz
P. cerasus cv. Montmorency v1.0.a1 proteins without NCBI nr (FASTA file)	Pcerasus_Montmorency_v1.0.a1.a1_vs_nr_noHit.fasta.gz
P. cerasus cv. Montmorency v1.0.a1 proteins with arabidopsis (Araport11) homologs (EXCEL file)	Pcerasus_Montmorency_v1.0.a1.a1_vs_arabidopsis.xlsx.gz
P. cerasus cv. Montmorency v1.0.a1 proteins with arabidopsis (Araport11) (FASTA file)	Pcerasus_Montmorency_v1.0.a1_vs_arabidopsis_hit.fasta.gz
P. cerasus cv. Montmorency v1.0.a1 proteins without arabidopsis (Araport11) (FASTA file)	Pcerasus_Montmorency_v1.0.a1_vs_arabidopsis_noHit.fasta.gz
P. cerasus cv. Montmorency v1.0.a1 proteins with SwissProt homologs (EXCEL file)	Pcerasus_Montmorency_v1.0.a1_vs_swissprot.xlsx.gz
P. cerasus cv. Montmorency v1.0.a1 proteins with SwissProt (FASTA file)	Pcerasus_Montmorency_v1.0.a1_vs_swissprot_hit.fasta.gz
P. cerasus cv. Montmorency v1.0.a1 proteins without SwissProt (FASTA file)	Pcerasus_Montmorency_v1.0.a1_vs_swissprot_noHit.fasta.gz
P. cerasus cv. Montmorency v1.0.a1 proteins with TrEMBL homologs (EXCEL file)	Pcerasus_Montmorency_v1.0.a1_vs_trembl.xlsx.gz
P. cerasus cv. Montmorency v1.0.a1 proteins with TrEMBL (FASTA file)	Pcerasus_Montmorency_v1.0.a1_vs_trembl_hit.fasta.gz
P. cerasus cv. Montmorency v1.0.a1 proteins without TrEMBL (FASTA file)	Pcerasus_Montmorency_v1.0.a1_vs_trembl_noHit.fasta.gz

Assembly

The Prunus cerasus Montmorency Genome v1.0 assembly file is available in FASTA format.

Downloads

Chromosomes (FASTA file)

pcerasus_Mont_v1.0.fasta.gz

Gene Predictions

The Prunus cerasus Montmorency v1.0.a1 genome gene prediction files are available in FASTA and GFF3 formats.

Downloads

Protein sequences (FASTA file)	pcerasus_Mont_v1.0.a1.proteins.fasta.gz
CDS (FASTA file)	pcerasus_Mont_v1.0.a1.cds.fasta.gz
Transcript sequences (FASTA file)	pcerasus_Mont_v1.0.a1.transcripts.fasta.gz
Gene sequences (FASTA file)	pcerasus_Mont_v1.0.a1.genes.fasta.gz
Genes (GFF3 file)	pcerasus_Mont_v1.0.a1.genes.gff3.gz

Functional Analysis

Functional annotation for the Prunus cerasus cv. Montmorency Genome v1.0.a1 are available for download below. The Prunus cerasus cv. Montmorency Genome v1.0.a1 proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).

Downloads

GO assignments from InterProScan	Pcerasus_Montmorency_v1.0.a1_genes2GO.xlsx.gz
IPR assignments from InterProScan	Pcerasus_Montmorency_v1.0.a1_genes2IPR.xlsx.gz
Proteins mapped to KEGG Orthologs	Pcerasus_Montmorency_v1.0.a1_KEGG-orthologis.xlsx.gz
Proteins mapped to KEGG Pathways	Pcerasus_Montmorency_v1.0.a1_KEGG-pathways.xlsx.gz

Transcript Alignments

Transcript alignments were performed by the GDR Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the Prunus cerasus Montmorency genome assembly. Alignments with an alignment length of 97% and 97% identify were preserved. The available files are in GFF3 format.

Fragaria x ananassa GDR RefTrans v1	Pcerasus_Montmorency_v1.0_f.x.ananassa_GDR_reftransV1
fragaria avium GDR RefTrans v1	Pcerasus_Montmorency_v1.0_p.avium_GDR_reftransV1
fragaria persica GDR RefTrans v1	Pcerasus_Montmorency_v1.0_p.persica_GDR_reftransV1
Rosa GDR RefTrans v1	Pcerasus_Montmorency_v1.0_rosa_GDR_reftransV1
Rubus GDR RefTrans v2	Pcerasus_Montmorency_v1.0_rubus_GDR_reftransV2
Malus_x_domestica GDR RefTrans v1	Pcerasus_Montmorency_v1.0_m.x.domestica_GDR_reftransV1
Pyrus GDR RefTrans v1	Pcerasus_Montmorency_v1.0_pyrus_GDR_reftransV1

Links

BLAST

View Assembly in JBrowse

View Synteny