Prunus sibirica F106 Whole Genome v1.0 Assembly & Annotation

Overview
Analysis NamePrunus sibirica F106 Whole Genome v1.0 Assembly & Annotation
MethodPacBio,BioNano and Hi-C
Source(F106) PacBio,BioNano and Hi-C
Date performed2021-03-11

 

Publication: submitted for publication

Authors list: Lin Wang#, Mengzhen Huang#, Zhuo Chen#, Jiao Zhang#, Xuewen Wang#, Gaopu Zhu, Han Zhao, Huimin Liu, Qiang Gao, Hongwei Lu, Wanyu Xu, Ningning Gou, Huilong Du, Xiuxiu Li, Chen Chen, Haikun Bai, Xuchun Zhu1, Chu Wang, Yujing Zhang, Tiezhu Li, Wenquan Bao, Qiuping Zhang, Xiaoli Zhang, Depeng Wang, Jiang Hu, Jingjing Li, Jeffrey L. Bennetzen*, Chengzhi Liang*, Ta-na Wuyun*

#These authors contributed equally: Lin Wang, Mengzhen Huang, Zhuo Chen, Hongwei Lu, Xuewen Wang.

*Corresponding author: Jeffrey L. Bennetzen*, Chengzhi Liang*, Ta-na Wuyun*.

Abstract:

Apricot is an economically and ecologically important species that has been cultivated for thousands of years. However, the genetic foundations of its origin, domestication, and agronomic traits remain unclear. Here, we provide chromosome-level genome sequence assemblies of the wild species Prunus sibirica (‘F106’) and two cultivated apricots (‘Sungold’ and ‘Longwangmao’). With scaffold N50 lengths of 10.5-14.1 Mb, the genome assemblies are 218-225 Mb in size and contain 32,669-32,987 predicted protein-encoding genes. We also obtained resequencing data for 307 accessions of different geographic origins. Our population genomic analyses indicate that kernel consumption apricots (KCA) germplasm has both extensive heterozygosity and admixture from fresh fruit P. armeniaca (FFA) indicates very recent hybridization between P. sibirica and FFA. We identified multiple genomic regions and candidate genes significantly associated with 12 important fruit and kernel traits, such as kernel taste, using genome-wide association studies. This study provides important resources for the genetic improvement of apricot germplasm.

Genome facts and statistics:

Three genomes, including one for the wild apricot species (Prunus sibirica, accession ‘F106’) and two for phenotypically distinct subtypes of cultivated apricot (Prunus armeniaca, FFA accession ‘Sungold’, and KCA accession ‘Longwangmao’) were sequenced and assembled using a combination of long read sequencing (PacBio), optical mapping (BioNano), and high-throughput chromatin conformation capture (Hi-C) technologies. The final genome assemblies covered 217 Mb with a scaffold N50 (where N50 ist heminimum scaffold length needed to cover 50% of the genome) length of 11.55 Mb for ‘Sungold’, 219 Mb with a scaffold N50 length of 10.50 Mb for ‘F106’, and 225 Mb with a scaffold N50 length of 14.10 Mb for ‘Longwangmao’. A total of 32,669, 32,959, and 32,987 protein-encoding genes with the average coding sequence length of ~1 kb and on average six exons per gene were annotated in the ‘Sungold’, ‘F106’, and ‘Longwangmao’ genomes, respectively.

Homology

Homology of the Prunus sibirica genome v1.0 proteins was determined by pairwise sequence comparison using the blastp algorithm against various protein databases. An expectation value cutoff less than 1e-9 was used for the NCBI nr (Release 2018-05) and 1e-6  for the Arabidoposis proteins (Araport11), UniProtKB/SwissProt (Release 2019-01), and UniProtKB/TrEMBL (Release 2019-01) databases. The best hit reports are available for download in Excel format. 

 

Protein Homologs

Prunus sibirica v1.0 proteins with NCBI nr homologs (EXCEL file) Prunus_sibirica_F106_v1.0_vs_nr.xlsx.gz
Prunus sibirica v1.0 proteins with NCBI nr (FASTA file) Prunus_sibirica_F106_v1.0_vs_nr_hit.fasta.gz
Prunus sibirica v1.0 proteins without NCBI nr (FASTA file) Prunus_sibirica_F106_v1.0_vs_nr_noHit.fasta.gz
Prunus sibirica v1.0 proteins with arabidopsis (Araport11) homologs (EXCEL file) Prunus_sibirica_F106_v1.0_vs_arabidopsis.xlsx.gz
Prunus sibirica v1.0 proteins with arabidopsis (Araport11) (FASTA file) Prunus_sibirica_F106_v1.0_vs_arabidopsis_hit.fasta.gz
Prunus sibirica v1.0 proteins without arabidopsis (Araport11) (FASTA file) Prunus_sibirica_F106_v1.0_vs_arabidopsis_noHit.fasta.gz
Prunus sibirica v1.0 proteins with SwissProt homologs (EXCEL file) Prunus_sibirica_F106_v1.0_vs_swissprot.xlsx.gz
Prunus sibirica v1.0 proteins with SwissProt (FASTA file) Prunus_sibirica_F106_v1.0_vs_swissprot_hit.fasta.gz
Prunus sibirica v1.0 proteins without SwissProt (FASTA file) Prunus_sibirica_F106_v1.0_vs_swissprot_noHit.fasta.gz
Prunus sibirica v1.0 proteins with TrEMBL homologs (EXCEL file) Prunus_sibirica_F106_v1.0_vs_trembl.xlsx.gz
Prunus sibirica v1.0 proteins with TrEMBL (FASTA file) Prunus_sibirica_F106_v1.0_vs_trembl_hit.fasta.gz
Prunus sibirica v1.0 proteins without TrEMBL (FASTA file) Prunus_sibirica_F106_v1.0_vs_trembl_noHit.fasta.gz

 

Assembly

The Prunus sibirica F106 Genome v1.0 assembly file is available in FASTA format.

Downloads

Chromosomes (FASTA file) Prunus_sibirica_F106_v1.0.fasta.gz

 

Gene Predictions

The Prunus sibirica F106 v1.0 genome gene prediction files are available in FASTA and GFF3 formats.

Downloads

Protein sequences  (FASTA file) Prunus_sibirica_F106_v1.0.proteins.fasta.gz
CDS  (FASTA file) Prunus_sibirica_F106_v1.0.cds.fasta.gz
Gene sequences  (FASTA file) Prunus_sibirica_F106_v1.0.genes.fasta.gz
Genes (GFF3 file) Prunus_sibirica_F106_v1.0.genes.gff3.gz

 

Functional Analysis

Functional annotation for the Prunus sibirica Genome v1.0 are available for download below. The Prunus sibirica Genome v1.0 proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).

Downloads

GO assignments from InterProScan Prunus_sibirica_F106_v1.0_genes2GO.xlsx.gz
IPR assignments from InterProScan Prunus_sibirica_F106_v1.0_genes2IPR.xlsx.gz
Proteins mapped to KEGG Pathways Prunus_sibirica_F106_v1.0_KEGG-orthologis.xlsx.gz
Proteins mapped to KEGG Orthologs Prunus_sibirica_F106_v1.0_KEGG-pathways.xlsx.gz

 

Transcript Alignments
Transcript alignments were performed by the GDR Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the fragaria salicina genome assembly. Alignments with an alignment length of 97% and 97% identify were preserved. The available files are in GFF3 format.

 

Fragaria x ananassa GDR RefTrans v1 Prunus sibirica F106_v1.0_f.x.ananassa_GDR_reftransV1
Prunus avium GDR RefTrans v1 Prunus sibirica F106_v1.0_p.avium_GDR_reftransV1
Prunus persica GDR RefTrans v1 Prunus sibirica F106_v1.0_p.persica_GDR_reftransV1
Pyrus GDR RefTrans v1 Prunus sibirica F106_v1.0_pyrus_GDR_reftransV1