Genome Naming Guideline

New Recommendations for Genome and Gene Nomenclature 

(Recommended by the AgBioData Consortium)

 

Genome Naming protocol:

  • Long version: [species] [sample identifier] [consortium]v[assembly version].[assembly subversion].a[annotation-version]
  • Short version: <ToLID>.<sample identifier>.<consortium>.<assembly version>.<assembly_subversion>.a<annotation version>.<optional> (use this as the assembly prefix for gene naming below)

Example:  

  • Malus x domestica Honeycrisp v1.1 (Long version)
  • drMalDome.HC.v1.1 (short version)

Where: 

  • species =  the genus and species of the organism. Use the species identifiers from the Tree of Life projec (ToLid) for short version.
  • sample identifier = cultivar or specific line used for sequencing
  • consortium = consortium or organization (if necessary)
  • assembly version = the version of the assembly with a major and minor number.  The major version is incremented with major changes or releases of the assembly and the minor number is incremented when minor changes are made to the assembly.
  • annotation-version = a single numeric value that is incremented each time a new annotation;It restarts at 1 each time the assembly version is incremented.
  • optional = an optional section for data curator use, suggested uses include previous naming requirements not included in the standard or a secondary Sample Identifier.

 

Gene Naming protocol:

  • Assembly prefix  - taking the assembly name generated above (short version) as a prefix, this will unambiguously identify which version of an assembly and annotation a gene model comes from.

  • Cellular Location: [*|m|p]: * mean no character and is used for nuclear genes, m is used for mitochondrial genes and p is used for plastid genes

  • Chromosome number: [2 digit chromosome #]: optional if sequences are contigs or scaffolds.

  • Entity – a set of defined entities can exist here (g for gene, p for protein, pan for pangene, and t for transcript) 

  • Id number – a unique numeric identifier can be generated for each gene model within the genome. 6 characters should be sufficient for numbering all gene models within an assembly. This number can be random, or numbered sequentially. The latter is helpful for quickly identifying adjacent gene models.

  • Optional:

    • Sub-genome and chromosome for species with polyploid genomes, some communities may find it helpful to include sub-genome and/or chromosome level information. However polyploid plant genomes can be quite dynamic, with subgenome chromosomal exchanges that may vary between individuals or populations. Therefore, including a subgenome designation in gene model names may confuse and should be used with caution.

    • Transcript Isoforms For multi-exon genes, existing nomenclature for labeling transcript isoforms varies. In most cases, isoforms are labeled sequentially with a dot ‘.’  followed by the numeric order of the transcript. This is not universal. In maize, the dot is replaced by “_”  followed by the letter “T” representing the transcript and filling the order of the transcript with padding of two digits. We propose using dot notation, for example, '.1' is the first isoform, .2 is the second, et cetera.

  • Combining these elements will produce gene model identifiers with names in the style of <assembly_prefix><entity><ID number> These elements would not require separators as they can be parsed as <assembly><entity><6 digit number>.

Examples: (This example do not include consortium but is recommended for future nomenclature)

gene example: 

  • With chromosome numbers:
    • Nuclear gene: drMalDome.HC.v1.1.a1.chr01A.g000010
    • Mitochondria gene: drMalDome.HC.v1.1.a1.m01.g000010
    • Plastid gene: drMalDome.HC.v1.1.a1.p01.g000010
  • Without chromosome numbers:
    • Nuclear gene: drMalDome.HC.v1.1.a1.g000010
    • Mitochondria gene: drMalDome.HC.v1.1.a1.mg000010
    • Plastid gene: drMalDome.HC.v1.1.a1.pg000010

mRNA/ transcript example:

  • With chromosome numbers:
    • Nuclear gene: drMalDome.HC.v1.1,a1.chr01A.g000010.t1
  • Without chromosome numbers:
    • Nuclear gene: drMalDome.HC.v1.1,a1.g000010.t1

 

If you have any questions, please contact us.

For Prunus genomes, please follow the chromosome terminology used by Prunus persica that has been accepted by the community. Discrepancy has been noted between some genomes such as Prunus mume.

Chromosome Peach Lovell (Dirlewanger et al 2004; Verde et al. 2013) Prunus mume (Zhang et al. 2012)

Size (Mbp) in Lovell v2.a1

1 1 2↓ 47.85
2 2 5 30.41
3 3 4 27.37
4 4 3 25.84
5 5 7 18.50
6 6 1 30.77
7 7 8 22.39
8 8 6↓ 22.57

Note that chromosomes 2 and 6 (corresponding to Lovell chromosomes 1 and 8, respectively) are reversed.