Malus x domestica Whole Genome v2.0 Assembly
The Malus x domestica genome v2.0 pseudo haplotype assemblies are a set of four different assemblies (primary and alternates 1, 2 and 3) derived from the contigs of the original v1.0 assembly (NCBI Accession: PRJNA28845, Velasco et al. 2010). The pseudo haplotypes are intended to divide overlapping contigs in the original assembly into four different consensus sequences representing the four different haplotypes present in the apple genome. However, these assemblies are not true haplotypes, hence the name "pseudo haplotypes". Previously, a set of four haplotypes was constructed to accompany the v1.0 assembly but only the primary haplotype, called v1.0p, is available on GDR. Using the contig sequences from the Malus x domestica assembly version v1.0 this new v2.0 assembly was obtained by removing 34,882 problematic contigs. A primary assembly representing approximately 80% of the assembled and anchored genome and three alternative assemblies were produced following NCBI’s specifications. Similar to the first set of pseudo-haplotypes, gaps of 200,000 N's were used to space scaffolds in the assembly. These pseudo haplotype assembles were constructed at the Fondazione Edmund Mach (http://www.fmach.it/eng/CRI).
The Primary Assembly
These primary pseudo haplotype assembly was contructed from the original assembly contigs in the following way. First small contigs were removed from the dataset, and contigs with large gaps were split. A weighted network graph was constructed with contigs as nodes and distances between contigs as edge weights. Dijkstra's algorithm was then used to find the shortest path between all nodes along the genome without overlaps (positive edge weight). The primary assembly represents the minimal number of gaps between non-overlapping contigs used to reconstitute the pseudomolecules (chromosomes) of the genome.
The Alternate Assemblies
The three alternative assemblies were also constructed using a weighted network approach, but instead employed the Activity Selection Problem algorithm to greedily select nodes for inclusion in a haplotype, with the goal to maximize number of non-overlapping contigs in each of the assemblies.
Assembly files for the Malus x domestica genome 2.0 pseudo haplotypes are divided into the four different haplotypes. A FASTA file of the pseudomolecules of the primary assembly and scaffolds of the alternate assemblies are available below. Additionally, GFF3 files of the contig alignments within each assembly are also provided.