Malus x domestica Whole Genome v1.0p Assembly & Annotation
The Malus x domestica genome v1.0 pseudo haplotype assemblies are a set of four different assemblies (primary and alternates 1, 2 and 3) derived from the contigs of the original v1.0 assembly. These pseudo haplotypes are intended to divde overlapping contigs in the original assembly into four different consensus sequences representing the four different haplotypes present in the apple genome. However, these assemblies are not true haplotypes, hence the name "pseudo haplotypes". Currently, only the primary pseudo haplotype assembly is available on GDR. Gaps of 200,000 N's were used to space scaffolds in the assembly. These pseudo haplotype assembles were constructed at the Instituo Agrario Di San Michelle All'Adige in collaboration with NCBI.
The Primary Assembly
These primary pseudo haplotype assembly was contructed from the original assembly contigs in the following way. First small contigs were removed from the dataset, and contigs with large gaps were split. A weighted network graph was constructed with contigs as nodes and distances between contigs as edge weights. Dijkstra's algorithm was then used to find the shortest path between all nodes along the genome without overlaps (positive edge weight). The primary assembly represents the minimal number of gaps between non-overlapping contigs used to reconstitute the pseudomolecules (chromosomes) of the genome.
The Alternate Assemblies
The three alternative assemblies were also constructed using a weighted network approach, but instead employed the Activity Selection Problem algorithm to greedily select nodes for inclusion in a haplotype, with the goal to maximize number of non-overlapping contigs in each of the assemblies.
All assembly and annotation files are available for download by selecting the desired data type in the right-hand side bar. Each data type page will provide a description of the available files and links do download.
Assembly files for the Malus x domestica genome v1.0 pseudo haplotypes are available in both GFF and FASTA format below.
For more information on the files available below, please see the description provided on the psuedo hapolotype assembly details page.
The gene predictions for the pseudo haplotype assemblies are the same consensus set from the original assembly, but have been mapped to the pseudomolecules of the haplotypes.
5' and 3' UTR regions are currently not available for gene models
Repeats were predicted for the original v1.0 combined assembly using read depth information from the genome assembly contigs. The FASTA file of repeats was then used as a repeat library for RepeatMasker which was used to predict repeats on the v1.0 primary haplotype assembly.