Fragaria iinumae Genome v1.0 Assembly & Annotation
Edger, P.P., McKain, M.R., Yocca, A.E. et al. Reply to: Revisiting the origin of octoploid strawberry. Nat Genet (2019) doi:10.1038/s41588-019-0544-2
Genome assembly and quality evaluation
Genome assembly was performed on PacBio long reads using FALCON v0.3.0(GitHub, 2018. Mar 18). Total genome coverage (~172X) before assembly was estimated by total bases from PacBio reads divided by the genome size (265.56 Mb) for F. iinumae. Error correction and preassembly were carried out with the FALCON pipeline after evaluating the outcomes of using different parameters in FALCON during the pre-assembly process. The draft genome, with a contig N50 of >10Mb, was polished with Arrow using all SMRT reads and polished using Pilon v1.22 using the Illumina reads (~107X coverage) with the default settings. A GC depth analysis was conducted to assess the potential contamination during sequencing and the coverage of the assembly. The completeness of the genome assembly was also evaluated using BUSCO (Benchmarking Universal Single Copy Orthologs) software. Previously a high-density linkage map of F. iinumae was constructed by 4173 markers, with 3280 from the Array and 893 from genotyping by sequencing. Here we anchored the contigs to this genetic map to obtain a chromosome-scale genome of F. iinumae.
For repeat detection, four software packages, i.e., RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html), RepeatScout10 (http://www.repeatmasker.org), Piler (http://www.drive5.com/piler/), and LTR-Finder (http://tlife.fudan.edu.cn/ltr_finder), were used to build a de novo repeat library on the basis of our assembly with the default settings. To identify known transposable elements (TEs) in the genomes, RepeatMasker8 (http://www.repeatmasker.org) was used to screen the assembled genome against the Repbase v22.11 and Mips-REdat libraries. We constructed a de novo long terminal repeat retrotransposon (LTR-RT) library by scanning the assembled F. iinumae genome using LTRharvest15 (-motif tgca -motifmis 1) and LTR_Finder (LTR length 100-5000nt, length between two LTRs: 1000-20000nt). Homolog-based, de novo-based, and RNA-sequencing (RNA-seq)-based gene prediction methods were used in combination to identify the protein-coding genes in the F. iinumae genome assembly. For homology-based predictions, protein sequences of Arabidopsis thaliana, Oryza sativa, Solanum lycopersicum, Fragaria vesca, and Malus domestica were used as the references. For de novo-based prediction, Augustus v2.4, GlimmerHMM v3.0.4, SNAP v2006, GeneID v1.4 and Genscan with default parameters were used for de novo-based gene prediction. All software was trained using the 1000 full-length genes from the homology-based predictions and Arabidopsis gene model before gene prediction (Supplementary Table 2 & 4). For the RNA-seq-based prediction, TransDecoder v2.0 (http://transdecoder.github.io.), GeneMarkS-T v5.1, and PASA v2.0.2 were used. Finally, the results from the three methods were integrated using EVM v1.1.1. All the genes were annotated by aligning to the Nucleotide collection (NR), Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes (KEGG database release 84.0). Then, InterProScan24 package was used to annotate the predicted genes using the InterPro (5.21–60.0) database.
Homology of the Fragaria iinumae genome v1.0 proteins was determined by pairwise sequence comparison using the blastp algorithm against various protein databases. An expectation value cutoff less than 1e-9 was used for the NCBI nr (Release 2018-05) and 1e-6 for the Arabidoposis proteins (Araport11), UniProtKB/SwissProt (Release 2019-01), and UniProtKB/TrEMBL (Release 2019-01) databases. The best hit reports are available for download in Excel format.
All assembly and annotation files are available for download by selecting the desired data type in the right-hand "Resources" side bar. Each data type page will provide a description of the available files and links do download. Alternatively, you can use the FTP repository for bulk download.
Functional annotation for the Fragaria iinumae genome v1.0 are available for download below. The Fragaria iinumae genome v1.0 proteins were analyzed using InterProScan in order to assign InterPro domains and Gene Ontology (GO) terms. Pathways analysis was performed using the KEGG Automatic Annotation Server (KAAS).
Transcript alignments were performed by the GDR Team of Main Bioinformatics Lab at WSU. The alignment tool 'BLAT' was used to map transcripts to the Fragaria iinumae genome assembly. Alignments with an alignment length of 97% and 97% identify were preserved. The available files are in GFF3 format.