Protein is coded for by DNA and RNA nucleotides in cis phase (haplotypic phase). When two or more non-synonymous nucleotide variations are present in a sequence therefore, unless the phase is established, it is not possible to assign protein sequences.
The word haplotype was introduced into the scientific lexicon by world renowned geneticist and HLA pioneer Ruggero Ceppellini in 1967 at the Third Histocompatibility Workshop in Turin, Italy, (Ceppellini et al., 1967), to describe alleles in the HLA system at the neighbouring HLA-A and B loci, which segregated together in family studies. Ceppellini stated ‘a new term can be introduced without increasing confusion, it is suggested to substitute pheno-group with haplotype’ (Petersdorf, 2017).
While the word haplotype still has an unambiguous meaning to those working primarily with the Major Histocompatibility Complex (MHC), the word has taken on a new meaning over the succeeding years, when describing polymorphisms of non-MHC genes. The original definition of haplotypes has been changed over the succeeding 60 years to the extent that two SNPs in one allele are now referred to as a haplotype (Davidson, 2000). In order to reduce any confusion that may arise over the use of the word ‘haplotype’, in most instances, I will refer to two linked polymorphisms as in genetic cis phase. Where two polymorphisms have been shown to be on homologous chromosomes, they are said to be in trans phase.
The phase of any genetic polymorphism whether they be two or more SNPs or two alleles is critical to understand, in order to fully appreciate the effect that genetic polymorphism can have on function. The fundamental reason for this is that proteins are coded for by gene sequences in cis phase. A genotype merely gives the sum of polymorphisms that exist in two inherited gene sequences, but tell us nothing about the functional protein sequences.
I will attempt in this review to describe methods for assessing phase, their advantages and drawbacks and then provide three examples in clinical medicine where I believe phase is critical in ascertaining true effects of therapy, but currently is not routinely performed.
2 ASSESSING GENETIC PHASE-CURRENT APPROACHES 2.1 Family studiesFamily studies to determine gene segregation within a nuclear family context are still the ‘gold standard’, having been since the early 1970s to demonstrate linkage of HLA A and B alleles (Ceppellini & Van Rood, 1974). Despite the unambiguous results obtained, in most cases family studies are generally not performed for non-MHC genes, particularly in a research setting, the main reasons being logistical and the cost involved.
2.2 Linkage disequilibrium (LD)When two genes are closely linked and occur together more often than is expected from their individual allele frequencies, the alleles are said to be in positive (LD). LD therefore can be used to predict genetic phase in the absence of family studies.
Such a mathematical approach has been used successfully in the HLA system to calculate the probability of haplotypes within the US Bone Marrow Registry (Mori et al., 1997). At the time of the analysis, there were in excess of 1.35 million HLA typed donors registered. By examining the A, B, DR phenotypes of 406,503 donors, they observed 302,867 distinguishable HLA-A, -B, -DR phenotypes. Such large numbers permitted the assignment of haplotypes even at relatively low frequency. In a separate analysis of bone marrow donors, Schmidt et al. (2009) performed a similar estimation of haplotype frequency, using HLA-A, B, C and DR donors typed at the allele level. Using this information, the authors were able to predict the probability of a given recipient finding a matched donor in a registry of a given size.
The sequencing of the human genome in 2001 (Lander et al., 2001; Venter et al., 2001) and the mapping of single nucleotide polymorphisms (SNPs) necessitated the need for defining haplotypes across the entire genome. As with the HLA system, a mathematical approach to haplotyping was used to measure linkage disequilibrium and several methods have been developed. A full review of mathematical models of genomic haplotyping is summarized by Bonizzoni et al. (2003).
The genome consists of haplotype blocks; therefore, a single nucleotide polymorphism can act as a marker for a disease gene as a result of LD, if it is located in the same haplotype block. One of the issues surrounding the use of tag SNPs is the variation observed in LD between racial and ethnic groups (Di Paola et al., 2005; Du et al., 2007; Seyerle et al., 2014).
Despite this observation and despite other inherent risks (Barton et al., 2001; Hodge et al., 1999; Reich et al., 2001), many papers appear in the literature relying on linkage disequilibrium to assign putative SNP haplotypes, or searching for SNP haplotype disease markers (Hao et al., 2020; Hohjoh et al., 1999; Mori et al., 2001; Yang et al., 2020).
The LD is most common one used when analysing two or more single nucleotide polymorphisms throughout the genome. In many cases, LD provides a high level of accuracy in the absence of definitiveness (Hodge et al., 1999). Another issue to consider when applying LD to assign disease haplotypes is that at least as observed in the HLA system, the pattern of LD observed in disease can be different from that observed in controls (Hanifi Moghaddam et al., 1998; Tait et al., 1995) when selective pressure is applied to two neighbouring genetic regions, which presents a problem in relying on LD for definitive haplotyping in disease studies.
2.3 Long range sequencingFirst, next generation sequencing (NGS) and then long-range sequencing (LRS) have provided a platform in combination with other unique technical and software approaches for defining haplotypes (Chintalaphani et al., 2021; Hosomichi et al., 2015; Mantere et al., 2019; Midha et al., 2019; Peters et al., 2012; Weitzel et al., 2007), which was not possible with conventional sequencing.
Although there have been reports of longer sequencing, the most common nucleotide length reported is of the order of 10–20 kb. In the future, it is to be expected that the reliable reproducible range will increase. A technique called Hi-C has been added to the sequencing armamentarium which makes it possible to sequence entire chromosomes in haploid phase (Lieberman-Aiden et al., 2009). Used in conjunction with Falcon-phase, up to 90% accuracy in phasing human haplotypes was achieved (Kronenberg et al., 2021). LRS and Hi-C require a deal of sophisticated software to achieve accurate phase.
2.4 Single chromosome sequencingOnly accuracy in excess of 98% is satisfactory for phasing in clinical medicine. Conceptually, single chromosome sequencing would replace all the above techniques and provide unambiguous haploid information. Unfortunately to date, this has not been achieved in a reproducible manner. In 2011 Christina Fan, Stephen Quake and co-workers from the Bioengineering Department, Stanford University, USA (Fan et al., 2011) described a microfluidic devise which was able to trap single metaphase cells, lyse, protease digest the cytoplasm and collect single chromosomes which could be amplified by multiple strand displacement amplification (MSDA) and then sequenced. This was an exciting development which does not seem to have been pursued, at least to the market stage, by individual research groups, or commercial companies. While single cell sequencing has been pursued, the determination of genetic phase or isolation of single chromosomes has not been the focus of further work for this group (Gawad et al., 2016).
3 EXAMPLES IN CLINICAL MEDICINE WHERE KNOWLEDGE OF GENETIC PHASING IS IMPORTANT 3.1 Haematopoietic stem cell transplantation (bone marrow transplantation)The term bone marrow transplantation refers to the time when this procedure was restricted to the use of syngeneic or allogeneic adult bone marrow to treat patients with leukaemias and other blood disorders whose bone marrow is depleted either as a result of a disease (aplastic anaemia) or treatment for conditions such as leukaemias. The use of cord blood or stem cell mobilized peripheral blood has broadened the definition so that haematopoietic stem cell transplantation (HSCT) is now the preferred term.
The first procedure to find an HLA compatible donor for a patient is to search within the nuclear family. Considering each parent has two HLA haplotypes, the chance of a sibling inheriting the two HLA haplotypes of the patient is 1:4. Given the average size of families in western countries is comparatively small, compared with previous generations, only approximately 30% of patients have an HLA-matched sibling.
HLA cis phase (in the HLA field, this is commonly known as haplotypes) is established for the patient at the time of family studies. However, the World Marrow Donor Association (WMDA) has recently (July 2021) reported over 39 million donors, both adult and cord, registered on their database (World Marrow Donor Association https://wmda.info/), the overwhelming majority of whom do not have HLA haplotypes established. This means when the database is searched, the primary aim is to find donors who are matched at multiple HLA loci (usually A, B, C, DRB1, DQA1, DQB1 and sometimes DPB1). No information is provided on haplotype matching.
What is the evidence that patients who are transplanted with a haplotype-matched unrelated donor have superior clinical outcome compared with patients who receive an HLA allele-matched donor?
The Perth group led by Roger Dawkins and Frank Christiansen were at the forefront of defining and understanding the nature of HLA haplotypes. They demonstrated that HLA haplotypes are generated by combining segments of ancestral haplotypes (Christiansen et al., 1991; Degli-Esposti et al., 1992). They also demonstrated that recombination was not random across the HLA region of the MHC, and that haplotypes could be defined as blocks of DNA which are recombined to form new haplotypes. The regions were referred to as alpha, beta, gamma delta and epsilon blocks. The alpha block includes HLA-A, the beta HLA-B and HLA-C, the gamma the class 3 region which includes complement genes, the delta the class 2 DR and DQ genes and epsilon which includes the DP genes. Using microsatellite (Msat) markers specific for each segment, the group were able to show that HLA-matched pairs were not necessarily haplotype matched.
The main criticism of the work, which can be made retrospectively, is that matching was performed largely by serology and not sequencing. However, in 2006 a follow-up paper using PCR primers for the beta block showed that in the presence of sequence identity at both HLA-B and C genes, there was evidence of haplotype disparity (Kitcharoen et al., 2006).
Further evidence was provided in three seminal papers by Effie Petersdorf's group from the Fred Hutchison Centre in Seattle. The first (Malkki et al., 2005) described the Msats specific for HLA haplotypes which had the obvious potential of providing haplotype matching in HSCT. The second (Guo et al., 2006) described a method for separating haplotypes using genomic DNA from individuals of known HLA genotype. Probes specific for the two HLA-B alleles were adhered to a glass surface and hybridized with the genomic DNA. After washing off excess DNA, the two captured DNA fragments were placed in separate tubes and PCR primers were used to amplify the HLA-A and DRB1 genes.
Using this method of physical separation, Petersdorf et al. (2007) were able to study the clinical effect of HLA haplotype matching in 246 unrelated donor HSCT. Of the 246 recipients who were all HLA-A, B, C, DRB1 and DQB1 matched with their donors, 191 (78%) were haplotype matched with their donors while 55 (22%) were mismatched.
There was a strong association of HLA haplotype mismatching and increased levels of acute GVHD Grades 3 and 4. The effect was seen in the first few days, maximized at 20 days and then plateaued so that at 100 days, the difference in incidence of severe acute GVHD was of the order of 35% (approximately 25% in the matched group vs. approximately 60% in the mismatched group). In addition, a significant difference was seen in survival rates between the two groups.
These results demonstrated clearly the value of HLA haplotype matching in A, B, C DR, DQ donor/recipient allele-matched pairs in HSCT. It is relevant to point out that HLA-C and DPB1 were not considered in this study; however, it is unlikely that the differences seen could be explained by incompatibilities at the DPB1 locus. Surprisingly, there has been no follow-up by other groups to use this method as a way of determining haplotype matching.
3.2 Immunotherapy check point inhibitors (CPI)One of the more exciting developments over the last few years has been the application of CPIs to inhibit the growth of cancer cells (Zhang et al., 2021). One of the more successful examples is the clinical use in metastatic melanoma of the monoclonal antibody Ipilimumab which targets the human CTLA- 4 (CD 152) receptor (Langer et al., 2007).
When antigen is presented via MHC molecules to the responding T cells, two co-stimulatory molecules expressed on the T cell can either amplify (CD28) or inhibit the T-cell response (CTLA-4 called CD152). CTLA-4 binds to B7-1 (CD 80) and B7-2 (CD86) as part of the inhibitory mechanism pathway. Blockade of CTLA-4 therefore has the effect of amplifying the anti-tumour response, first described by James Allison (Leach et al., 1996) for which he and Tasuku Honjo received the Nobel Prize in 2018 and has subsequently been shown to be particularly effective in melanoma (Attia et al., 2005; Hodi et al., 2010), which led to FDA approval for clinical use in 2011, making it the first approval of a CPI for use in human cancer treatment.
With respect to the question of genetic phase, it is noteworthy that CTLA-4 is polymorphic (Hodi et al., 2010). Several polymorphic sites have been identified within the non-coding and leader peptide coding region of the gene which appear to influence both expression and cancer risk (Wagner et al., 2021). These findings suggest that these disease associations are primarily due to differences in gene control and expression, but do not exclude the possibility that as yet unidentified coding variants may interfere with binding of CTLA-4 to its ligand.
Elegant crystallography studies have revealed the molecular interaction between Ipilimumab and the CTLA-4 receptor (Ramagopal et al., 2017). Critical residues of CTLA-4 involved in binding, as shown by creating mutants, are S20, R35, R40, Q76, D88, K95, E97, Y104, L106 and I108, none of which have been shown to date to be polymorphic. However, in other antibody/antigen systems, non-conservative changes in amino acid sequence which are not part of the binding site can be responsible for altering the structure of the antibody binding site which in turn compromises antibody binding. The full impact of the described polymorphisms will not be fully appreciated until accurate SNP phasing is established and a range of ethnic groups are studied. The importance of SNP phasing in CTLA-4 has been demonstrated in a study of liver cancer in Han Chinese. Different SNP haplotypes were shown to be both protective and susceptibility markers (Yang et al., 2019).
3.3 Micro-RNA moleculesNon-coding DNA was originally thought to be ‘junk’ DNA. However, since only approximately 2% of human DNA codes for proteins (Elgar & Vavouri, 2008), it is difficult to imagine that nature invests in this degree of redundancy, in the absence of a useful biological purpose.
The discovery of both long non-coding RNA and microRNA as essential components of the gene regulation network has been one of the most significant and exciting discoveries of the past few years, which has implications for a host of human diseases. In this review, I will concentrate on the micro-RNA molecules, their known polymorphisms, the implications on gene expression and the importance of genetic phase on understanding the precise effects these molecules have on biological processes.
Micro-RNA molecules (miRNA) are single-stranded RNA molecules, of usually 22 nucleotides long which act at the post-transcription stage by binding onto part of the 3′ regulatory region of the gene, which inhibits the translation of the gene. The first report of miRNA was described in the bacteria Caenorhabditis elegans (Lau et al., 2001). Lau and colleagues described 55 miRNA molecules, the expression of which varied at different stages of development. In the same issue of the journal Science, Lagos-Quintana described miRNAs in both invertebrate and vertebrates (Lagos-Quintana, 2001). Zeng et al. (2002) demonstrated that miRNAs were produced in vivo in human cells and had the capacity to regulate the expression of genes. According to the latest version of mIrBase version 22 (miRNA Database, 2021), there are now 1881 loci which code for mature miRNA molecules. This therefore represents an enormously large and complex regulation of gene expression.
From the viewpoint of genetic phase, there is convincing data that despite the evolutionary conserved nature of these molecules, there is documented polymorphisms in miRNA (Carbonell et al., 2012).
There is now ample evidence in a range of different diseases including pre-eclampsia (Li et al., 2021), type 2 diabetes (Yan et al., 2021), lung fibrosis after lung transplantation (Wang et al., 2020), non-alcoholic fatty liver disease (López-Sánchez et al., 2021), liver cancer (Ghafouri-Fard et al., 2021) and multiple sclerosis (Shareef et al., 2021), that polymorphisms in the relevant miRNA can confer susceptibility and determine the effectiveness of suppression of gene expression. In addition, polymorphisms in the miRNA binding domain of target gene regions play a role in diseases such as colorectal cancer where SNPs in the solute carrier transporter gene predict clinical outcome in colorectal cancer (Bendova et al., 2020) and the KRT81 gene in breast cancer (Sha et al., 2020). There are numerous other examples in this ever-expanding field of research (Liu et al., 2021; Vohra et al., 2020).
This effect on gene expression has created a new paradigm for explaining at a functional level, associations of disease with gene alleles. For example, the MHC contains a greater density of disease associations than any other part of the human genome (Clark et al., 2015). This group identified 89 miRNA transcripts (Clark et al., 2018) within the MHC, approximately 50% of which lie in linkage disequilibrium blocks that contain disease-related SNPs. These data hint that at least some HLA and disease associations are functionally related to levels of expression. An example is a SNP in the upstream region of HLA-C (−35C/T) which is associated with a high level of HLA-C expression and HIV+ individuals with the T allele progress more slowly to AIDS (Thomas et al., 2009). The authors interpreted the results in terms of more effective viral peptide presentation to cytotoxic T cells. The C allele is a proxy for high HLA-C mRNA allele expression which in turn makes for more effective control of HIV infection. About 3.5 million years ago, a sequence exchange event occurred between HLA-C and HLA-B giving rise to a variant which escaped miRNA regulation, the marker being the SNP at position-35. This sequence exchange resulted in 7 of the 14 HLA-C lineages identified today (O'hUigin et al., 2011). The important point to note from the viewpoint of phase is that the −35 SNP is in linkage disequilibrium with other polymorphisms which have been shown to directly affect the binding of miRNA 148a, such as the deletion at position 263 downstream of the stop codon of HLA-C.
Location of the target gene does not inform where a miRNA which influences expression is located. For example, in an elegant series of experiments (Rooda et al., 2020) studying follicular development of ovarian cells examined two previously identified miRNA (Velthut-Meikas et al., 2013) derived from the introns of two genes, namely FSHR (follicular stimulating hormone receptor), located on chromosome 2 and CYP19A1 (cytochrome P450 family 19 subfamily A number 1) located on chromosome 15. These two miRNA molecules, namely miRNA 548bc and miRNA 7973, respectively, influence the expression of at least seven genes. miRNA 548bc influences expression of the genes Neogenin, a member of the immunoglobulin superfamily NEO1 (chromosome 15), LIFR, leukaemia inhibitory factor receptor (chromosome 5), PTEN phosphatase tensin homologue (chromosome 10) and SP110 nuclear body protein (chromosome 2). miRNA 7973 influences the expression of ADAM 15 which is a member of the metalloproteinase family ADAM and is located on chromosome 1, FMNL3, a formin-like gene found on chromosome 12, and FXDN, peroxidasin gene found on chromosome 2. Further complexity in interpretation is added by the fact these regulatory elements act in both a cis and trans configuration on their target genes (Elcheva & Spiegelman, 2020).
The discovery of the complexity of gene expression and the role that non-coding DNA plays in this process is arguably one of the most exciting and significant developments of this century. Control of gene expression via miRNA and other molecules will take years to fully understand. Elucidation of genetic phase in both polymorphic miRNA and their gene targets will play a vital role, and until genetic phase is definitively determined, the full story surrounding gene expression will not be fully understood.
ACKNOWLEDGEMENTSThe author wishes to acknowledge productive discussions over many years with the late Dr Malcolm Simons who was one of the first to recognise that “junk” DNA polymorphisms were “ordered” and haplotype specific, indicating a functional role. I also wish to recognize the insights and contributions from long term colleagues and friends Grant Mraz and Geoff Swanson and more recently from current work colleagues Anne Donaldson and Greg Allen.
留言 (0)