Fragaria x ananassa Camarosa Genome Assembly v1.0 & Annotation v1.0.a1
Edger PP, Poorten TJ, VanBuren R, Hardigan MA, Colle M, McKain MR, Smith RD, Teresi SJ, Nelson ADL, Wai CM, Alger EI, Bird KA, Yocca AE, Pumplin N, Ou S, Ben-Zvi G, Brodt A, Baruch K, Swale T, Shiue L, Acharya CB, Cole GS, Mower JP, Childs KL, Jiang N, Lyons E, Freeling M, Puzey JR, Knapp SJ. Origin and evolution of the octoploid strawberry genome. Nature genetics. 2019 Feb 25.
About the Assembly
A near-complete chromosome-scale assembly for cultivated octoploid strawberry (Fragaria × ananassa).
Sequencing, Assembly, and Annotation
The genome of the cultivar 'Camarosa' was sequenced using a combination of short- and long-read approaches, including Illumina (San Diego, CA), 10X Genomics (Pleasanton, CA), and Pacific Biosciences (PacBio; Menlo Park, CA), totaling 615-fold coverage of the genome. Illumina (455-fold coverage) and 10X Genomics (117-fold coverage) data were assembled and scaffolded using the software package DenovoMAGIC3 (NRGene, Nes Ziona, Israel). The genome was further scaffolded to chromosome-scale using Hi-C data (401-fold coverage) in combination with the HiRise pipeline (Dovetail, Santa Cruz, CA), and gap-filled with 43-fold coverage error corrected PacBio reads using PBJelly. The total length of the final assembly is 805,488,706bp distributed across 28 chromosome-level pseudomolecules, representing ~99% of the estimated genome size based on flow cytometry measurements. A genetic map for F. x ananassa was used to correct any mis-assemblies and comparisons to F. vesca to identify homoeologous chromosomes.
108,087 protein-coding genes were annotated along with 30,703 long non-coding RNA (lncRNA) genes, which is divided up into 15,621 long intergenic ncRNAs (lincRNAs), 9,265 antisense overlapping transcripts (or AOT-lncRNAs), and 5,817 sense overlapping transcripts (or SOT123 lncRNAs). Gene annotation and genome assembly quality were evaluated using the Benchmarking Universal Single-Copy Orthologs V2 (BUSCO) method. Most (99.17%) of the 1,440 core genes in the embryophyta dataset were identified in the annotation, supporting a high-quality genome assembly. The repetitive components of the nuclear genome was annotated using a custom repeat library approach, including DNA transposons, long-terminal-repeat retrotransposons (LTR-RTs; e.g., Copia and Gypsy), and non-LTR retrotransposons. TE related sequences make up ~36% of the total genome assembly, with LTR-RT being the most abundant (~28%). The plastid and mitochondrial genomes were also assembled, annotated, and verified for completeness.
All assembly and annotation files are available for download by selecting the desired data type in the left-hand "Resources" side bar. Each data type page will provide a description of the available files and links to download. Alternatively, you can browse all available files on the FTP repository.
The Fragaria x ananassa Camarosa v1.0 genome assembly file is available in FASTA format.
The Fragaria x ananassa Camarosa Genome v1.0.a1 genome gene prediction files:
The repetitive components of the nuclear genome was annotated using a custom repeat library approach, including DNA transposons, long-terminal-repeat retrotransposons (LTR-RTs; e.g., Copia and Gypsy), and non-LTR retrotransposons. TE related sequences make up ~36% of the total genome assembly, with LTR-RT being the most abundant (~28%).
Functional annotation for the Fragaria x ananassa Camarosa genome v1.0.a1 are available for download below. The Fragaria x ananassa Camarosa genome v1.0 proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).
The following functional annotaions for Fragaria x ananassa Camarosa Genome v1.0.a1 were provided by the original institute:
Homology of the Fragaria x ananassa Camarosa genome v1.0 proteins was determined by pairwise sequence comparison using the blastp algorithm against various protein databases. An expectation value cutoff less than 1e-9 was used for the NCBI nr (Release 2018-05) and 1e-6 for the Arabidoposis proteins (TAIR10), UniProtKB/SwissProt (Release 2018-04), and UniProtKB/TrEMBL (Release 2018-04) databases. The best hit reports are available for download in Excel format.