Malus x domestica Whole Genome v1.0p Assembly & Annotation

Analysis NameMalus x domestica Whole Genome v1.0p Assembly & Annotation
MethodFour contig tiling patterns used to generate four pseudo-haplotypes
SourceSanger reads
Date performed2012-08-20

The Malus x domestica genome v1.0  pseudo haplotype assemblies are a set of four different assemblies (primary and alternates 1, 2 and 3)  derived from the contigs of the original v1.0 assembly.  These pseudo haplotypes are intended to divde overlapping contigs in the original assembly into four different consensus sequences representing the four different haplotypes present in the apple genome.  However, these assemblies are not true haplotypes, hence the name "pseudo haplotypes".  Currently, only the primary pseudo haplotype assembly is available on GDR.  Gaps of 200,000 N's were used to space scaffolds in the assembly.  These pseudo haplotype assembles were constructed at the Instituo Agrario Di San Michelle All'Adige in collaboration with NCBI.

The Primary Assembly

These primary pseudo haplotype assembly was contructed from the original assembly contigs  in the following way.  First small contigs were removed from the dataset, and contigs with large gaps were split.   A weighted network graph was constructed with contigs as nodes and distances between contigs as edge weights. Dijkstra's algorithm was then used to find the shortest path between all nodes along the genome without overlaps (positive edge weight). The primary assembly represents the minimal number of gaps between non-overlapping contigs used to reconstitute the pseudomolecules (chromosomes) of the genome.

The Alternate Assemblies

The three alternative assemblies were also constructed using a weighted network approach, but instead employed the Activity Selection Problem algorithm to greedily select nodes for inclusion in a haplotype, with the goal to maximize number of non-overlapping contigs in each of the assemblies.


All assembly and annotation files are available for download by selecting the desired data type in the right-hand side bar.  Each data type page will provide a description of the available files and links do download.


Assembly files for the Malus x domestica genome v1.0 pseudo haplotypes are available in both GFF and FASTA format below.

For more information on the files available below, please see the description provided on the psuedo hapolotype assembly details page.


Primary haplotype pseudomolecules (FASTA file) Malus_x_domestica.v1.0-primary.pseudo.fa.gz
Primary haplotype pseudomolecules (GFF file) Malus_x_domestica.v1.0-primary.pseudo.gff3.gz
Primary haplotype scaffold alignments (GFF file) Malus_x_domestica.v1.0-primary.scaffolds.gff3.gz
Primary haplotype scaffold sequences (FASTA file) Malus_x_domestica.v1.0-primary.scaffolds.fa.gz
Primary haplotype contig alignments (GFF file) Malus_x_domestica.v1.0-primary.contigs.gff3.gz
Repeat Masked pseudomolecules (FASTA file) Malus_x_domestica.v1.0-primary_masked.fa.gz


Gene Predictions

The gene predictions for the pseudo haplotype assemblies are the same consensus set from the original assembly, but have been mapped to the pseudomolecules of the haplotypes.

5' and 3' UTR regions are currently not available for gene models


Consensus gene model CDSs (FASTA) Malus_x_domestica.v1.0-primary.CDS.fa.gz
Consensus gene model proteins (FASTA) Malus_x_domestica.v1.0-primary.protein.fa.gz
Consensus gene model mRNA (FASTA) Malus_x_domestica.v1.0-primary.mRNA.fa.gz
Consensus gene models (transcripts) (GFF3) Malus_x_domestica.v1.0-primary.transcripts.gff3.gz



Repeats were predicted for the original v1.0 combined assembly using read depth information from the genome assembly contigs.  The FASTA file of repeats was then used as a repeat library for RepeatMasker which was used to predict repeats on the v1.0 primary haplotype assembly.


Predicted repeats aligned to chromosomes (GFF file) Malus_x_domestica.v1.0-primary.repeats.gff3.gz
Predicted repeats aligned to chromsomes (FASTA file) Malus_x_domestica.v1.0-primary.repeats.fa.gz