Curating genomic disease-gene relationships with Gene2Phenotype (G2P)

LGMDE threads

The LGMDE thread is described here in detail, including definitions, terminology, data sources, and utilisation in the curation process.

Locus

The locus is typically a gene or can be a genomic interval (defined as chromosome:genomic coordinates for a given reference genome assembly). For genes, the HUGO Gene Nomenclature Committee (HGNC) [10] symbol is used for ease of reference. The symbol is mapped to the relevant stable numerical HGNC ID.

GenotypeAllelic requirement

Standardised Human Phenotype Ontology (HPO) [11] allelic requirement terms are used. These have corresponding Mendelian inheritance terms, for example, monoallelic_autosomal—Autosomal dominant—HP:0000006. In general, genes which have been associated with multiple allelic requirements for a given disease require separate G2P entries. This would apply, for example, to monoallelic_autosomal and biallelic_autosomal disorders. X-linked conditions which are usually penetrant in males and recessive in females may be recorded as monoallelic_X_hemizygous. X-linked disease where heterozygous females and hemizygous males have similar phenotypes, for example in relation to SHOX and SMC1A variants, are recorded as monoallelic_X_heterozygous. However, it is recognized that this distinction may be difficult in practice, and separate entries can be used after discussion by the curation group.

Cross-cutting modifier

Optional cross-cutting modifiers give extra information for a gene-disease relationship. The terms used largely correspond to children of “Inheritance qualifier” (HP:0034335), for example, “Typically de novo” (HP:0025352). A subset of these HPO terms is included in the curation template. These have been chosen to focus curation on collection of data most relevant to diagnostic filtering.

Two additional cross-cutting modifiers not representing inheritance information are defined. The first is for alerting the user to potential secondary findings (including American College of Medical Genetics Secondary Findings [12] and/or late onset conditions). The second modifier is for a “Restricted Variant Set”. This may include, for example, where diseases are associated with a single recurrent variant, or variants only found in a particular protein domain. Further information regarding the specific variant set is recorded elsewhere in the curation data.

Types of variants reported

A comprehensive list of Sequence Ontology (SO) [13, 14] terms is used to record the types of variants reported in association with a disease entity. For example, “frameshift_variant” (SO:0001589) (13). Additional information is recorded for each variant type, including if it is reported as de novo and/or inherited and/or of unknown inheritance, whether it is predicted to escape or trigger nonsense-mediated decay (NMD) and gene domain/genomic region. This is important for filtering genomic data, as well as in defining the mechanism of disease.

Protein view

A snapshot of the DECIPHER [15] protein view for the relevant gene is included (Fig. 2). DECIPHER is a web platform developed to enable the annotation and sharing of anonymised phenotype-linked variants. It integrates essential genomics resources and provides visualisations and interactive tools to facilitate variant interpretation.

Fig. 2figure 2

DECIPHER protein view for the gene REST [15]. Red (likely loss of function) filled variants in the ClinVar variants track, on the left side of the image, are annotated as pathogenic (squares)/likely pathogenic (triangles). These are associated with predisposition to Wilms tumour [16]. The Predicted NMD Escape track is not filled in this region, indicating these variants are likely to result in NMD. Red/orange areas show areas of Regional Missense Constraint in the corresponding track. Missense variants in the DNA binding domain—shown by the green oval—in this region have also been reported in association with Wilms tumour [16]. The filled red triangles/squares on the right of the image in the ClinVar track are associated with gingival fibromatosis [17]. These are in the final exon and hence predicted not to result in NMD as shown in the Predicted NMD Escape track. This is therefore likely to be a different disease mechanism than for Wilms tumour. There are relatively few gnomAD loss of function variants and none of these correspond to the likely pathogenic/pathogenic ClinVar variants. The DECIPHER protein view allows for all these sources of information to be visualised at once. Further data is available through interaction with the web-based interface at www.deciphergenomics.org, which is regularly updated. Data is pulled from multiple bioinformatic resources. NMD, Nonsense Mediated Decay

The DECIPHER protein view for a given gene is an information-dense graphical representation including, for example, exon boundaries, predicted NMD escape regions, conservation levels, missense constraint and protein domains. Annotation includes gnomAD [18] variants from global populations, as well as DECIPHER and ClinVar [19] variants from diagnostic testing (including reported pathogenicity).

The use of the DECIPHER protein view allows curators to easily put individual-level gene-disease information from publications into a wider context, encompassing molecular/gene-level data, disease-associated/diagnostic variation and population-level normal variation (Fig. 2).

The function of the gene as defined by UniProt [20] is also shown in DECIPHER [15, 20]. This is used in the curation process to determine if the proposed gene-disease relationship is likely to be biologically plausible.

DECIPHER [15] is a constantly updated resource. The live website interface is often used in the curation process to interrogate the above data further. However, a snapshot of the protein view used at the time of curation is recorded for reference, to enable comparison with the most recent data at the time of re-curation.

Variant consequence per allele

For a given allelic requirement, SO terms for the variant consequence are used [13, 14]. This may be recorded as inferred or evidence based. Inferred includes, for example, when a variant is predicted to result in nonsense-mediated decay. Evidence based is usually used for biological functional studies such as demonstration of absent protein expression. Computational evidence may also be taken into account, such as modelling of protein structure. Terms for both altered protein (for protein-coding genes) or altered RNA level (for non-protein coding genes) are included. For example, a missense variant may be recorded as altered_gene_product structure (SO:0002318).

As SO terms are part of an ontology, higher level terms may be inferred for a recorded variant consequence. For example, absent_gene_product (SO:0002317) is a child of decreased_gene_product_level (SO:0002316), which itself is a child of altered_gene_product_level (SO:0002314).

Mechanism

Enabling a precise definition of the mechanism of disease to be captured is a crucial function of G2P. This follows the definitions laid out by Backwell and Marsh [21]. The mechanism is initially recorded in broad categories depending on the protein level consequence of variants reported. These include ‘Loss of function’, ‘Dominant negative’ and ‘Gain of function’. There are also categories for ‘Undetermined’ and ‘Undetermined non-loss-of-function’, reflecting the fact that there are often cases where a mechanism is clearly not a loss of function, but where is difficult to distinguish between dominant-negative and gain-of-function effects from available evidence. It is recorded whether these are inferred or from functional evidence. The level of functional evidence is not currently directly quantified e.g. by number of publications. The ‘evidence’ assertion is made by consensus agreement within the curation group, taking into account the quality and quantity of functional data. For example, are the analyses reported biologically relevant to the gene, disease and putative mechanism? Are the results replicated by separate groups?

Synopsis of mechanism

As per Backwell and Marsh [21], the complex effects of disease-associated variants are not fully captured using broad terms such as ‘loss of function’ alone. Therefore, G2P records a more detailed synopsis of the inferred or evidence-based molecular mechanism. For example, ‘Destabilising loss of function’ or ‘Interaction-disrupting loss of function’. There is often insufficient evidence for newly described disorders to record this information. This may be completed when functional studies become available for a particular condition/variant(s).

The process of determining the likely mechanism for a given disorder is complex. Multiple lines of evidence are reviewed, where available. These may include, for example, observations of variant clustering and in vitro/in vivo functional assays. Tools predicting likely mechanism may be used for guiding assessments, although these should not be relied upon alone [22]. MaveDB is interrogated for relevant multiplexed assays of variant effect (MAVEs) [9]. Free text fields are currently used to record analysis of this evidence during the curation process. For example, it is critical to evaluate if a MAVE assay is relevant to the disorder being curated, especially whether the multiplexed assay truly reflects the mechanism of disease in vivo. Hence, we have prioritised curation of mechanism in G2P. Curation discussion may include information such as the functional domains assayed, how well it replicates the mechanism of disease for the stated gene-disease pair, and which tissue/cell-line is relevant.

Disease entityClinical phenotype

The reported clinical phenotype is recorded per publication. This includes the number of families/individuals reported, including information on consanguinity and/or ethnicity if relevant. G2P records phenotype data in the form of HPO terms [11], which are standardised and machine readable. However, it is also useful for curation purposes to record descriptive free text regarding the phenotype. For example, the proportion of individuals reported with a given phenotypic feature (i.e. variable expressivity), whether the phenotype is clinically distinctive and/or consistent, and if there is evidence for incomplete penetrance. The clinical phenotype is of crucial importance in determining the confidence level for a given gene-disease association. For example, whilst there is a phenotypic spectrum for any genetic condition, it is usually the case that core features would be shared across virtually all individuals for a newly described disorder. Reported incomplete penetrance is reviewed critically; there may be an alternate genetic or other explanation for the described phenotype. On the other hand, if all reported individuals share a rare phenotypic feature, this puts more weight towards a true gene-disease association. At present in G2P, this is a subjective judgement relying on clinical experience, given the difficulties in quantifying and comparing phenotypes across persons and diseases using standardised terminology [23].

Disease name

Genetic disease naming is a complex topic, in part reflecting the evolution in knowledge from clinical descriptions to molecular diagnosis. Conditions which have been well-defined clinically in the past may be known by an eponymous name such as Noonan syndrome [24]. However, this naming system does not reflect the molecular basis of disease. This is especially true for conditions such as Noonan, where a similar phenotype is now known to result from variants in multiple genes. The dyadic naming system suggested by Biesecker et al. [25] aims to address these issues by including a gene symbol with a phenotypic descriptor—for example, PTPN11-related Noonan syndrome. Ideally, a precise clinically relevant phenotypic descriptor is used. For example, AMOTL1-related orofacial clefting, cardiac anomalies, and tall stature.

G2P records disease names following this dyadic approach. However, the process of naming a condition is not straightforward. We recognise that an international collaborative approach is needed to address this topic, as other curation resources may define diseases differently, or use an alternate naming convention [7]. If a disorder has been named—in a compatible format—by another group, G2P aims to use this name, to enhance standardisation across resources. In some cases, the gene symbol may be added to an existing disease name to maintain the dyadic structure. Mapping to other resources is added, where available, for example to OMIM (Online Mendelian Inheritance in Man) Morbid IDs and Mondo IDs [26, 27]. Disease synonyms from these and other sources are also included.

G2P is updated at least monthly with information from the latest research publications. Many newly defined conditions do not have a recognised disease name. In this case, the curation group agrees on a dyadic name reflecting the most pertinent phenotypic features and plans to submit these for Mondo accessioning to enable reuse.

Agreed confidence category

A confidence attribute is assigned to each G2P entry to indicate the likelihood that the gene-disease association is true. G2P now uses the gene-disease validity terms developed by GenCC [7]. Gene-disease associations curated as ‘Definitive’, ‘Confirmed’ and ‘Moderate’ are used by several groups in clinical reporting, for example in the DDD study [3]. Assertions with the confidence term ‘Limited’ are excluded from clinical reporting; however, variants found in this group may be useful in the research setting and they may be promoted to a higher confidence level as further evidence becomes available. ‘Disputed’ and ‘Refuted’ are also used in G2P to indicate previously reported gene-disease links that should now be excluded from research or clinical use.

Panel

G2P is grouped into broad panels, which each focus on a disease grouping or defined category of clinical presentation of relevance to the clinical diagnosis of Mendelian disease. The curation structure outlined here is presently used by the DDG2P curation group, although it is anticipated other G2P panels will adopt it in future.

Evidence

Links to the original peer-reviewed publications analysed during the curation process are recorded, generally in the form of PubMed ID [8] and title for ease of reference. Manuscripts from non-peer-reviewed sources such as MedRxiv are generally not included, except in exceptional circumstances.

Implementation

The template and process described here is used in regular DDG2P curation meetings. Disorders are assessed by individual curators and the template populated. This is then brought to the curation meeting for discussion by the group, which always includes at least one clinical geneticist. Input from outside experts on particular conditions may also be sought, for example where the molecular mechanism is particularly complex.

留言 (0)

沒有登入
gif