Prunus sibirica F106 Whole Genome v1.0 Assembly & Annotation
Publication: submitted for publication
Authors list: Lin Wang#, Mengzhen Huang#, Zhuo Chen#, Jiao Zhang#, Xuewen Wang#, Gaopu Zhu, Han Zhao, Huimin Liu, Qiang Gao, Hongwei Lu, Wanyu Xu, Ningning Gou, Huilong Du, Xiuxiu Li, Chen Chen, Haikun Bai, Xuchun Zhu1, Chu Wang, Yujing Zhang, Tiezhu Li, Wenquan Bao, Qiuping Zhang, Xiaoli Zhang, Depeng Wang, Jiang Hu, Jingjing Li, Jeffrey L. Bennetzen*, Chengzhi Liang*, Ta-na Wuyun*
#These authors contributed equally: Lin Wang, Mengzhen Huang, Zhuo Chen, Hongwei Lu, Xuewen Wang.
*Corresponding author: Jeffrey L. Bennetzen*, Chengzhi Liang*, Ta-na Wuyun*.
Apricot is an economically and ecologically important species that has been cultivated for thousands of years. However, the genetic foundations of its origin, domestication, and agronomic traits remain unclear. Here, we provide chromosome-level genome sequence assemblies of the wild species Prunus sibirica (‘F106’) and two cultivated apricots (‘Sungold’ and ‘Longwangmao’). With scaffold N50 lengths of 10.5-14.1 Mb, the genome assemblies are 218-225 Mb in size and contain 32,669-32,987 predicted protein-encoding genes. We also obtained resequencing data for 307 accessions of different geographic origins. Our population genomic analyses indicate that kernel consumption apricots (KCA) germplasm has both extensive heterozygosity and admixture from fresh fruit P. armeniaca (FFA) indicates very recent hybridization between P. sibirica and FFA. We identified multiple genomic regions and candidate genes significantly associated with 12 important fruit and kernel traits, such as kernel taste, using genome-wide association studies. This study provides important resources for the genetic improvement of apricot germplasm.
Genome facts and statistics:
Three genomes, including one for the wild apricot species (Prunus sibirica, accession ‘F106’) and two for phenotypically distinct subtypes of cultivated apricot (Prunus armeniaca, FFA accession ‘Sungold’, and KCA accession ‘Longwangmao’) were sequenced and assembled using a combination of long read sequencing (PacBio), optical mapping (BioNano), and high-throughput chromatin conformation capture (Hi-C) technologies. The final genome assemblies covered 217 Mb with a scaffold N50 (where N50 ist heminimum scaffold length needed to cover 50% of the genome) length of 11.55 Mb for ‘Sungold’, 219 Mb with a scaffold N50 length of 10.50 Mb for ‘F106’, and 225 Mb with a scaffold N50 length of 14.10 Mb for ‘Longwangmao’. A total of 32,669, 32,959, and 32,987 protein-encoding genes with the average coding sequence length of ~1 kb and on average six exons per gene were annotated in the ‘Sungold’, ‘F106’, and ‘Longwangmao’ genomes, respectively.
Homology of the Prunus sibirica genome v1.0 proteins was determined by pairwise sequence comparison using the blastp algorithm against various protein databases. An expectation value cutoff less than 1e-9 was used for the NCBI nr (Release 2018-05) and 1e-6 for the Arabidoposis proteins (Araport11), UniProtKB/SwissProt (Release 2019-01), and UniProtKB/TrEMBL (Release 2019-01) databases. The best hit reports are available for download in Excel format.
The Prunus sibirica F106 v1.0 genome gene prediction files are available in FASTA and GFF3 formats.
Functional annotation for the Prunus sibirica Genome v1.0 are available for download below. The Prunus sibirica Genome v1.0 proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).
Transcript alignments were performed by the GDR Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the fragaria salicina genome assembly. Alignments with an alignment length of 97% and 97% identify were preserved. The available files are in GFF3 format.