The VGNC: expanding standardized vertebrate gene nomenclature

Overview of VGNC data

As of February 2023, there are 111,210 approved genes in VGNC. These are mostly protein coding genes (109,923) but we have approved a small number of pseudogenes (1286) and 1 non-coding RNA. The non-coding RNA gene was originally approved as a protein coding gene but its locus type has since been updated to non-coding RNA. 109,955 genes have been approved in the seven core VGNC species. A summary of the number of genes approved per core species is shown in Table 1. A further breakdown of the numbers of automatically vs. manually approved protein coding genes in the core VGNC species is shown in Fig. 2. VGNC also currently approves cytochrome P450 genes in a number of additional vertebrate species; the total number of genes approved in the 24 non-core species is 1254 (see Additional file 1: Table S1 for a breakdown per species).

Fig. 2figure 2

Numbers of automatically (green) and manually (yellow) approved protein coding genes in core VGNC species. As of February 2023. Estimated total number of protein coding genes in the genome as annotated by Ensembl (red) and NCBI (blue) are indicated with lines. Based on genome assemblies: Chimpanzee—Pan_tro_3.0 (GCA_000001515.5, NCBI) Clint_PTRv2 (GCA_002880755.3, Ensembl); Cow—ARS-UCD1.2 (GCA_002263795.2); Horse—EquCab3.0 (GCA_002863925.1); Dog—ROS_Cfam_1.0/Dog10K_Boxer_Tasha (GCA_014441545.1); Macaque—Mmul_10 (GCA_003339765.3); Cat—Felis_catus_9.0 (GCA_000181335.4); Pig—Sscrofa11.1 (GCA_000003025.6)

Approved genes are made public and searchable on our website https://vertebrate.genenames.org, which is updated on a daily basis. The full list of approved VGNC genes can be browsed and filtered by species and/or coding status (Fig. 3). Information about each individual gene is displayed on “Symbol Report” pages, which include basic information about the gene, links to the corresponding NCBI and Ensembl gene annotations as well as links to specialist gene databases for that species if present, links to protein resources for the gene product, and links to named orthologs of the gene (Fig. 3).

Fig. 3figure 3

Screenshots showing the VGNC homepage, search result page, and example gene symbol report. An example workflow is highlighted in red: Clicking on “Gene symbol reports” in the “Gene data” menu will run a search for all approved VGNC gene entries, shown in the second screenshot. Search results can be further filtered using the options on the left of the search results, and clicking on an individual result will take the user to the symbol report for that gene, shown in the third screenshot

Coverage of human genes with approved orthologs in VGNC

We assessed the proportion of protein coding genes that have been named by the VGNC project, to identify what curation steps are required to name the remainder. As of January 2022, there were 19,220 HGNC-approved protein-coding genes for human—17,883 (93%) had at least 1 named VGNC ortholog (Fig. 4). Of the 1367 without any ortholog approved in VGNC (Fig. 4, Additional file 3: Table S3), the majority fell into the following categories: genes in large families or with copy number variation that required more detailed manual analysis before nomenclature assignment in vertebrate species (Fig. 4, yellow/dotted segments); genes for which the human nomenclature was unsuitable for transferral to other species (Fig. 4, blue/horizontal striped segments); and genes that likely have simple 1:1 orthology relationships across species but did not pass our automated orthology prediction threshold (3 out of 4 orthology assertions in Panther, NCBI Gene, Ensembl Compara, and OMA) (Fig. 4, green/vertical striped segments).

Fig. 4figure 4

Categorization of human genes with no named orthologs in VGNC. (Left) Pie chart showing proportion of named human protein-coding genes with and without approved orthologs in VGNC, as of January 2022 (n = 19,220). (Right) Pie chart categorizing reasons that no VGNC orthologs were approved for some human genes, as of January 2022 (n = 1367). Complex gene family members = 759 genes; lineage-specific duplication = 234 genes; Nomenclature issues = 99 genes; Human readthrough annotations = 76 genes (*all 76 genes have since been approved in VGNC); Lacking HCOP support = 54 genes; Closely related human genes = 34 genes; Annotation issues in VGNC species = 19 genes; Not in current human annotation set = 14 genes, Other = 78 genes. See article text for further explanation of each category and Additional file 3: Table S3 for full list of genes, which also indicates genes that have subsequently been approved in VGNC via manual curation

In many cases, the absence of a VGNC ortholog of a human gene is due to gene number variation causing uncertain orthology relationships, which is common in gene families that have frequent gene gains and losses throughout vertebrates. Large gene families often require manual curation including phylogenetic analysis of many genes across multiple species in order to assign nomenclature that accurately reflects evolutionary relationships. We found that 759 human protein coding genes did not yet have orthologs approved in VGNC due to their membership in a complex gene family (Fig. 4, “Complex gene family members”). Examples include genes encoding zinc finger containing proteins, keratins, and interferons. Other examples that require manual input from a curator include genes that have undergone lineage specific duplication in humans or primates. We found that 234 human protein coding genes had not yet had a VGNC ortholog approved due to lineage specific duplications (Fig. 4, “Lineage specific duplication”). In all of these cases, manual curation is required to decide what nomenclature is appropriate to reflect the evolutionary relationships.

The VGNC pilot project in which chimpanzee genes were manually approved provided an opportunity to review human gene nomenclature for suitability of use outside of humans. While it is preferable to use the same nomenclature for orthologous genes in different species to enable their quick identification, there are some human genes with nomenclature unsuitable for transfer to other species. This was recognized at the “Gene Nomenclature Across Species” meeting [9] where a key recommendation was that “humanizing” nomenclature in other species should be avoided. Genes with human-centric nomenclature have been reviewed and the gene names updated while the gene symbol has been retained, where possible, often with the agreement of the communities working on them. Examples include human disease-specific gene names such as “malignant fibrous histiocytoma amplified sequence 1” (MFHAS1, HGNC:16,982), which was renamed to “multifunctional ROCO family signaling regulator 1” (while retaining the same gene symbol) to make it suitable for use across species, and names that included reference to other species, such as “dispatched homolog 1 (Drosophila)” (DISP1, HGNC:19,711) which was renamed to “dispatched RND transporter family member 1”. There are still 99 genes with names referencing human disease that have not yet been renamed, and the orthologs of these genes have therefore not yet been approved in VGNC species (Fig. 4, “Nomenclature issues”).

We found that 54 human protein coding genes appear to have 1:1 orthology across VGNC species but did not pass our orthology prediction threshold for inclusion in the VGNC database, for reasons we could not identify (Fig. 4, “Lacking HCOP support”). These orthology relationships will require further investigation to confirm 1:1 orthology before approval in VGNC. In a further 76 cases, readthrough annotations between adjacent genes on the human reference genome caused orthology prediction tools to fail to find 1:1 orthologs, since the non-human gene was predicted to have two human orthologs: the “true” ortholog and a readthrough annotation containing some or all of the same coding region (Fig. 4, “Human readthrough annotations”). All of these cases have since been manually reviewed and approved in at least one VGNC species, as when the readthrough annotations are disregarded, the genes are clearly 1:1 orthologs.

Nineteen human genes appear to have orthologs in VGNC species but problems with the gene annotations in Ensembl and NCBI meant that they have not been automatically approved via the VGNC pipeline (Fig. 4, “Annotation issues in VGNC species”) and will not be approved until there is at least one suitable gene annotation to link to the VGNC entry. Thirty-four human genes appear to be single copy in all VGNC species but have closely related paralogs and so the orthology prediction resources could not distinguish between orthologs and paralogs across species (Fig. 4, “Closely related human genes”); these will require careful review across species before nomenclature assignment. Fourteen human genes that have been named by HGNC are not annotated on the current reference genome and so orthology prediction resources do not include these genes in their datasets (Fig. 4, “Not in current human annotation set”).

Further, less common, reasons for a gene having no VGNC ortholog approved were combined into a final category of “Other” (Fig. 4). This includes genes for which there is no consensus on the locus type in humans between Ensembl, NCBI, and HGNC, i.e., it is unknown whether the gene is protein-coding or not. The “Other” category also includes a small number of genes in complex immune-related families where 1:1 orthologs do not generally exist across these species, such as the killer immunoglobulin-like receptors and major histocompatibility complex genes. Although other species have members of these gene families, there is no 1:1 orthology and so unique gene symbols will be approved in each species [19].

Naming pseudogenes in VGNC

The VGNC has not yet prioritized the systematic naming of pseudogenes across multiple species; however, there are some specific examples of pseudogenes receiving approved nomenclature: large gene families such as the olfactory receptors, cytochrome P450s, and histones have a significant proportion of pseudogenes, and any pseudogenes within these families with NCBI/Ensembl annotations, that have also been manually curated by our expert collaborators or VGNC curators, have been named. The VGNC database also includes some pseudogenes that were initially approved as protein coding but their gene models have since been updated to pseudogenes.

An area of particular interest for the VGNC has been approving orthologs of genes that are pseudogenized in humans but coding in other species (so-called “unitary” pseudogenes). These are genes that would otherwise not receive approved nomenclature via automated means, because vertebrate gene naming is often based on the human ortholog and orthology prediction algorithms do not typically include pseudogenes in their data sets, and therefore vertebrate orthologs need to be manually identified. There are currently 274 HGNC-approved human pseudogenes that have protein coding orthologs in other species and have been named as such. The majority of these pseudogenes were initially named relative to mouse protein coding orthologs. To date, we have approved nomenclature for multiple orthologs of 104 human unitary pseudogenes (Additional file 4: Table S4) and will continue to prioritize these genes in our manual curation. For example, the chymosin (CYM) gene, which encodes a protease also known as “rennin,” is pseudogenized in primates [20]. Our manual curation allowed the coding orthologs to receive approved nomenclature in all non-primate core VGNC species (Fig. 5A). Another example is cytidine monophospho-N-acetylneuraminic acid hydroxylase (CMAH), which in most mammals encodes an enzyme that hydroxylates N-acetylneuraminic acid to N-glycolylneuraminic acid. CMAH is pseudogenized in humans [21], which has been postulated to have evolutionarily contributed to humans’ higher endurance ability [22] and predisposition to atherosclerosis [23]. CMAH is also actively studied in the context of xenotransplantation—since it is not active in human, xenotransplant tissue from species with an intact CMAH gene may trigger antibody-mediated rejection when implanted in humans [24]. We have manually curated the nomenclature for the coding CMAH gene in all 7 VGNC core species (Fig. 5B).

Fig. 5figure 5

Examples of manually curated VGNC orthologs of human pseudogenes. Simplified synteny diagrams (gene models not to scale) illustrate the synteny comparisons that are made by VGNC curators when curating these orthologs across species. Gene models colored green indicate protein-coding genes, purple gene models indicate pseudogenes. A chymosin (CYM) is pseudogenized in primates, but coding in other VGNC species. B cytidine monophospho-N-acetylneuraminic acid hydroxylase (CMAH) is pseudogenized in human, and coding in all VGNC species

Gene groups in VGNC

The VGNC recently introduced a feature called “Gene Groups” which has been a part of the HGNC database for over 20 years [16]. Human genes are grouped based on shared characteristics such as homology, structure, common functions and/or phenotypes, and protein complex membership. We have introduced a subset of these Gene Groups to VGNC where we have completed considerable VGNC curation of large gene families, i.e., the olfactory receptors, keratins, histones and cytochrome P450s [16]. Our curation of the keratins was largely based on a publication that characterized this gene group in dog and horse [25]. The histones have been named in collaboration with histone experts as reported in our publication of a standardized nomenclature for mammalian histone genes [26]. Cytochrome P450 nomenclature was assigned in collaboration with experts; as well as the 7 core VGNC species, nomenclature has also been assigned to CYP genes in a further 24 species. Gene Group reports allow visualization and navigation of hierarchical groups in complex gene families (Fig. 6).

Fig. 6figure 6

Example of the VGNC gene group hierarchy navigator for Olfactory receptor family 14. Gene group hierarchies are displayed on gene group pages and show the parent and child groups of the gene group of interest. The current gene group is highlighted in orange. In this example, we can see that Olfactory receptor family 14 has the parent group “Olfactory receptors” and 6 child groups representing its subfamilies. The groups shown in the hierarchy diagram are clickable and so can be used to navigate through a hierarchical gene group. The user can also choose to enable “rearrange mode” and click and drag to reposition the groups in the hierarchy

VGNC data dissemination

VGNC-approved nomenclature is automatically imported and displayed by the NCBI [13], Ensembl [12], and UniProt [27] databases. This ensures that once a gene has been approved in VGNC it has a consistent gene symbol and name across these resources and is visible to the community even if they are not VGNC website users. An additional benefit of this nomenclature dissemination is that as more accurate nomenclature is included for key vertebrate species in NCBI and Ensembl, the more likely it is that the appropriate nomenclature will be assigned in these databases’ automated nomenclature pipelines even for non-VGNC curated species, as these pipelines are based on orthology between species.

Discussion and future plans

The VGNC has approved nomenclature for over 100,000 genes in a variety of vertebrate species, and the approved nomenclature is disseminated widely via major biological databases. The benefits of approved nomenclature extend beyond just the species included in the VGNC, as both NCBI and Ensembl use homology to automatically project nomenclature to other species in their databases. The VGNC project has also led to large scale revision of nomenclature for complex gene families, for example, completely independent naming systems were in place for the olfactory receptor genes in different species [17], which made it impossible to determine orthology and paralogy relationships based on nomenclature. Our ongoing efforts to harmonize olfactory receptor gene nomenclature across species will make homology relationships obvious at a glance.

VGNC manual curation has resulted in improvements to gene nomenclature that would not have been possible using current automated techniques. Several species-specific duplications of genes or regions have been identified and assigned novel nomenclature: for example, there has been a tandem duplication in the human lineage leading to duplication of the matrix metallopeptidase 23 (MMP23) gene and subsequent pseudogenization of one of the copies. The intact human gene has the symbol MMP23B and the pseudogene has the symbol MMP23AP. Since orthology prediction algorithms typically only include coding sequences, the single copy orthologs in other species were being automatically assigned identical nomenclature to human MMP23B in other resources. VGNC manual curation allowed identification of this issue and thus the non-human orthologs have now been correctly approved as MMP23. Similarly, VGNC has now manually approved nomenclature for at least 104 orthologs of human pseudogenes that would not have received approved nomenclature by other means (Additional file 4: Table S4).

Manual curation has also led to corrections in data beyond gene nomenclature. It has been possible to identify issues with automatically predicted gene models in NCBI and Ensembl annotation sets such as merging of neighboring genes or fragmented gene models. Correspondence with RefSeq curators at NCBI has allowed for periodic review and correction of gene models in their database which then allows VGNC nomenclature to be approved and linked to a corrected NCBI Gene ID. For example, as part of the VGNC olfactory receptor (OR) curation project [17], VGNC curators made note of where NCBI RefSeq OR gene annotations did not match the predicted gene models curated by our expert OR advisors; we subsequently provided our curated OR gene data to RefSeq and this was used to generate updated gene models for olfactory receptors in RefSeq for dog, horse, and cattle.

Challenges faced in the VGNC project include the use of different genome assembly versions between NCBI and Ensembl, making it more difficult for curators to compare gene models and synteny in the two different versions since coordinates and annotations differ across assemblies. For example, NCBI and Ensembl have been annotating different versions of the chimpanzee genome since 2018. We currently do not approve gene nomenclature if we cannot link to a suitable gene model in either NCBI or Ensembl, and at present there is no provision within Ensembl to make routine manual corrections to gene models in species outside of human and mouse.

A large majority of the genes approved in VGNC are those in which orthology has been easily determined using the approach described and thus have been able to be automatically approved or quickly manually approved. Manual curation is time consuming and hence our approach so far has been to concentrate our efforts to maximize the number of genes approved. More recently, we have focused on complex gene families, taking a multidisciplinary approach to assign nomenclature across multiple species. This often occurs in collaboration with other nomenclature authorities such as the MGNC, RGNC, CGNC, Xenbase, and ZNC as previously described. These efforts will be coupled with expansion to the Gene Groups feature in VGNC, including the addition of both more Gene Groups and their members.

The VGNC’s remit to date has been to assign nomenclature to coding genes, but in future we intend to explore the naming of non-coding genes, including pseudogenes and non-coding RNA genes. This will likely be limited to non-coding genes that are either highly conserved across species or have been characterized in the literature. Nomenclature approval for non-coding RNA genes will begin with microRNA genes. Human microRNA genes are currently assigned gene symbols as a result of a long standing collaboration between the HGNC and miRBase [28]. MicroRNA identifiers are provided by miRBase and follow the format mir-# (e.g., mir-17) for the stem loop and miR-# (e.g., miR-17) for the mature miRNA, while HGNC approves the format MIR# for the encoding gene (e.g., MIR17). Equivalent gene symbols are currently approved for mouse and rat microRNA genes by MGNC and RGNC; the mouse and rat orthologs of human MIR17 have the gene symbol Mir17. In future, we will look into incorporating microRNA orthology predictions to approve symbols for microRNA genes that are orthologous to human microRNA genes for our seven core species. We will also explore approving symbols for long non-coding RNA (lncRNA) genes that have HGNC-curated mouse orthologs and have been approved unique symbols from publications. This would be a small number of lncRNA genes and we would not expect that adequately annotated orthologs would be present in all key species to allow approval of VGNC symbols.

Other future improvements we have planned for the VGNC include the development of tools to improve curation efficiency, for example, based on synteny across multiple species. We also plan to implement additional quality control tools to allow curators to quickly identify data changes that affect approved gene nomenclature, for example, when gene annotation identifiers are changed in NCBI and Ensembl.

留言 (0)

沒有登入
gif