Resolving unknown nucleotides in the IPD-IMGT/HLA database by extended and full-length sequencing of HLA class I and II alleles

Full-length sequencing of incomplete alleles

The sequences of 19 HLA class I (Table 1, part A) and 7 HLA class II (Table 1, part B) alleles were extended to full-length sequences. In most cases, only the sequences of exons 2 and 3 for HLA class I and exon 2 for HLA class II were present in the IPD-IMGT/HLA database; there was no information on the non-coding sequence available for any of these alleles. For HLA class I, we resolved 8638 unknown nucleotides for the coding sequence and 51,582 for the non-coding sequence; for HLA class II, we resolved 2139 unknown nucleotides for the coding and 39,167 for the non-coding region.

HLA class I

Coding nucleotides could be added to the database for 18 out of the 19 HLA class I alleles, because of previously unknown exon sequences (Table 1, part A). Only for B*44:13, no coding nucleotides could be added because the exon sequences were already known. For each allele, the number of coding and non-coding nucleotides that could be added to the database is indicated in Table 1, part A. In 14 out of 19 alleles, no differences with the reference alleles were found in the non-coding regions. In 5 out of 19 alleles, there was a difference in the non-coding (5′ UTR/introns/3′ UTR) regions of the allele compared to the first allele of the same allele group (indicated in Table 1A by a hash symbol). These differences are listed in Table 2. More detailed evaluation of possible recombinations showed the following:

Table 2 Alleles with differences in the non-coding (5′ UTR/introns/3′ UTR) sequences compared to the reference allele

By comparing B*53:06 with B*53:01:01:01, a total of 10 differences were found: 1 in exon 2, 1 in intron 2, 2 in exon 3, and 6 in the 3′ UTR region (Table 3). In fact, these differences were not found in any of the other B*53 alleles, and therefore the sequence was compared to all other HLA-B alleles in the IPD-IMGT/HLA database. It turned out that the sequence of B*53:06 is identical to B*51:01:01:04 up to genomic position 726 and from 810 to the end of the 3′ UTR. The part between 695 and 900 is identical to B*53:01:01:01 and many other B*53 alleles. Therefore, the B*53:06 allele may have arisen by a gene conversion event with a double cross over, resulting in a recombination of B*51:01:01:04 with a B*53 allele, exchanging the first part of exon 3, with breakpoints between 695–726 and 810–900 (Table 3). With the first identification of this allele, it was already serological typed as B53/B51-like variant (Anholts et al. 2001), the expert assigned type was also B53/B51 (Holdsworth et al. 2009), the neural network assignment was B53 (Maiers et al. 2003), and also the recently published systematic classification of serological specificities assigned a B53 serotype, with the comment being short cross reactive (Osoegawa et al. 2022).

Table 3 Comparison of part of the sequences of B*53:01:01:01, B*53:06, and B*51:01:01:04 illustrating that B*53:06 may have arisen by a gene conversion event with double cross over, recombining B*51:01:01:04 and a B*53 allele

Comparing the full-length sequence (including exons) of C*03:04:19 with C*03:04:01:01 revealed that there are only 2 nucleotide differences, one in exon 3 at position 993 (G > A) and one in intron 3 at position 1030 (G > A). Since both A’s are present in most C*07 alleles, it could be possible that C*03:04:19 was the result of a recombination between C*03:04 and a C*07 exchanging a part between 970 and 1080 of C*07 into the C*03:04 allele. Except for the majority of the C*07 alleles, the presence of both A’s (993 and 1030) together was only detected in 2 other C alleles, C*12:181 and C*15:02:33:01. It is tempting to speculate that these alleles also arose in the same way.

For B*27:23, we did not observe any differences with B*27:01 in non-coding sequences (Table 1, part A). But this latter sequence is only known till position 2707, 30 nucleotides after the stop codon. Therefore, the remaining nucleotides of the 3′ UTR region from B*27:23 were also compared with B*27:04:01 and B*27:05:02:01, the two sequences that are known till the end at position 3799. Compared with B*27:05:02:01, the B*27:23 allele shows no differences in 3′ UTR, whereas there are 4 differences with B*27:04:01 (T2812C, T3474C, C3574T, T3611C). At the first discovery of this allele, it was already noticed that this allele arose by a gene conversion event between B*27:05:02 and a B*35 allele (Darke et al. 2002).

HLA class II

In all HLA class II allele cases, both coding and non-coding nucleotides could be added to the database (Table 1, part B); in all cases, no non-coding sequence was available. In all alleles, there was a difference in the non-coding (5′ UTR/introns/3′ UTR) regions of the allele compared to the first allele of the same allele group, and these differences are listed in Table 2. Details about some interesting observations are as follows:

Comparing DQA1*01:06 with DQA1*01:01:01:01 revealed a striking phenomenon for this DQA1*01:06 allele. Whereas the 5′ and 3′ UTR sequences of DQA1*01:06 are identical to DQA1*01:01:01:01, there are many differences between these alleles in the intron sequences (Table 4). In fact, all intron sequences of DQA1*01:06 are identical to DQA1*01:02:01:04/14, but there are differences in the 5′ and 3′ UTR between these alleles. Concerning the exon sequences, DQA1*01:06 is identical to DQA1*01:02:01:04/14 except for one nucleotide in exon 2 at position 3974 (gDNA, = position 199 cDNA, codon 44), which has been reported as a unique mutation for DQA1*01:06 at the first discovery of this allele (Luo et al. 1999), and still is; no other DQA1 allele has a G at this position; in all alleles, an A nucleotide is conserved. All together, this suggests that DQA1*01:06 is a unique allele that may have arisen by a double cross over event between DQA1*01:01:01:01 and one of the alleles DQA1*01:02:01:04 or DQA1*01:02:01:14 with an additional point mutation at position 3974 (Table 4).

Table 4 Comparison of part of the sequences of DQA1*01:01:01:01, DQA1*01:06, and DQA1*01:02:01:04/14 illustrating the DQA1*01:06 may have arisen by a gene conversion event with double cross over, recombining DQA1*01:01:01:01 and DQA1*01:02:01:04/14, and an additional mutation at position 3974

Beside a difference in exon 3, the DQA1*04:04 has another difference with DQA1*04:01:01:01, namely an insertion of an A at position 3205, resulting in a homopolymer of 13 A nucleotides compared to 12 in DQA1*04:01:01:01. Of the 22 DQA1*04 alleles for which the intron 1 sequence is known, 5 have a homopolymer of 12A, 13 have a homopolymer of 13A, and 4 have a homopolymer of 14A. All sequencing methods have difficulty with accurately sequencing homopolymers. Although we used 3 different sequencing methods for this part of the DQA1*04:04 allele, the real number of A nucleotides present in this homopolymer is hard to determine, although with all three methods, the different analysis programs identified 13 A nucleotides at this position.

Comparing the full-length sequence of the DQB1*03:114 allele with all other DQB1*03 alleles revealed huge differences in both coding and non-coding nucleotide sequences of the alleles belonging to the serological specificity DQ7 compared with DQ8/DQ9 specificity, whereas there are only a few differences between DQ8 and DQ9. The full-length sequence of the allele DQB1*03:114 could be identified as a DQ7 sequence. Since the serological specificity is merely depending on the serological reaction with specific antibodies against the beta-1 domain of the HLA molecule, we have also checked the exon 2 sequence of this allele. Osoegawa et al. (2022) have determined that the critical residues for DQ7/8/9 are amino acids 45, 57, 74, 84, and 85, the latter 3 are identical for all 3 types, but important for distinguishing them from the other serological DQ types. In fact, the difference between DQ7, DQ8, and DQ9 only depends on the amino acids 45 and 57, where ED or EA encodes DQ7, GA encodes DQ8 and GD DQ9. The allele DQB1*03:114 has ED at these positions, clearly identifying this allele as a DQ7 serotype, which fits with the full-length sequence. To investigate whether these full-length sequence differences always fit with the serological type, we prepared a phylogenetic tree, based on the full-length sequences (−150–6502) of the DQB1*03 alleles available in the IPD-IMGT/HLA database without exon 2, depicted in Fig. 1. This phylogenetic tree shows three distinct groups, one reflecting the DQ7 like sequence group, one DQ8 like, and one DQ9 like group, with less distance between DQ8 and DQ9, because of less differences. Comparing the alleles in these groups with their final serological assignment based on the two amino acids in exon 2, as compiled by Osoegawa et al. (2022) (Supplemental Table 5 of their paper), revealed 2 alleles, where the serological type does not fit with the full-length sequence type, namely DQB1*03:10 and 03:12, both serologically typed as DQ9 according to Osoegawa et al. (2022), whereas their full-length sequence is DQ7 like. DQB1*03:10 was previously identified by the experts as DQ3, by the neural network as DQ7 (Maiers et al. 2003), and assigned by the WHO as DQ8 (Barker et al. 2023), whereas DQB1*03:12 was also by the experts and by the neural network identified as DQ9. Both alleles have sequences completely identical to DQB1*03:01:01 except for exon 2, with 1 (DQB1*03:10:01), 2 (DQB1*03:10:02), or 4 (DQB1*03:12) nucleotide differences, changing the crucial amino acid 45 from E to G. Whether these alleles evolved from a double recombination between DQB1*03:01:01 and DQB1*03:03 or whether they arose by point mutation(s) in exon 2 is not clear.

Fig. 1figure 1

Phylogenetic tree of HLA-DQB1*03. A multiple sequence alignment was performed on the full-length sequences from position −150 up to 6502 without exon 2 from all DQB1*03 alleles for which this sequence was available in the IPD-IMGT/HLA database (vs. 3.51), excluding null and Q alleles. Since Osoegawa et al. (2022) already identified DQB1*03:06 and 03:25 as DQ4 serotypes, we have excluded these alleles from this phylogenetic tree. Details of the alleles are listed in Supplemental Table 1A. The outlier, DQB1*03:72, was serotyped DQ9 by Osoegawa et al. (2022), but the full-length sequence was completely identical to DQB1*04:02:01:04, except exon 2. The scale bar indicates the length of the tree edges, corresponding to the differences between two allele sequences as calculated by the R SeqinR package (Charif and Lobry 2007)

Comparing the DQB1*06 full-length sequences of the three alleles in this study, DQB1*06:03:11, DQB1*06:73, and DQB1*06:286 with the first allele of this allele group, DQB1*06:01:01:01, we noticed that the DQB1*06:01 alleles showed huge differences with all other DQB1*06 alleles in both the coding and non-coding part, except 06:103 and 06:243, implicating that these alleles belong to one lineage within the DQB1*06. Also, between other DQB1*06 groups, differences were observed in both coding and non-coding regions, but to a lesser extent. To determine the lineages within the DQB1*06 allele group, we prepared a phylogenetic tree, shown in Fig. 2, using the full-length sequences (−30–6410) of all available DQB1*06 alleles in the IPD-IMGT/HLA database. As expected, there was one lineage which had a huge evolutionary distance to the other lineages and was composed of all 8 DQB1*06:01 alleles, DQB1*06:103 and DQB1*06:243. The other DQB1*06 alleles could be divided in 3 further lineages, DQB1*06:02like, composed of DQB1*06:02 and similar alleles, DQB1*06:03like and DQB1*06:04like, to which also the more frequent DQB1*06:09 alleles belonged. This tree also shows that there are not so many differences between DQB1*06:02like and 06:03like as between DQB1*06:02like and 06:04like or between 06:03like and 06:04like. This information might be helpful in regard to matching possibilities and antibody definition.

Fig. 2figure 2

Phylogenetic tree of HLA-DQB1*06. A multiple sequence alignment was performed on the full-length sequences from position −30 up to 6410 from all DQB1*06 alleles for which this sequence was available in the IPD-IMGT/HLA database (vs. 3.51). Details of the alleles are listed in Supplemental Table 1B. The scale bar indicates the length of the tree edges, corresponding to the differences between two allele sequences as calculated by the R SeqinR package (Charif and Lobry 2007)

Addition of 5′ and 3′ UTR sequences of HLA class I

During full-length sequencing for diagnostic purposes, we noticed that some full-length sequenced alleles lacked parts of the 5′ and/or 3′ UTR sequence. Table 5 shows the 47 alleles for which we added additional 5′ and/or 3′ UTR sequences to the already existing allele sequence. In total, we added 5.5 kb unknown nucleotides to the 5′ UTR sequence of 24 alleles and > 31.7 kb to the 3′ UTR sequence of 47 alleles. In none of the cases did comparison of these sequences with the reference allele reveal a difference in the 5′ UTR region. But, in 24 of the cases, differences were found in the 3′ UTR region (Table 5).

Table 5 HLA class I alleles extended with 5′ and 3′ UTR sequences with the EMBL-ENA accession numbers, IPD-IMGT/HLA submission numbers and the number of non-coding nucleotides for 5′ and 3′ UTR added to the database

In some cases, the sequence could not be completely compared with the first allele of the allele group, because of the missing part of the 3′ UTR sequence. In those cases, the next allele from the allele group with sufficient sequence information was chosen as reference sequence, indicated with a hash symbol in Table 5.

In most cases, the differences in the resolved 3′ UTR region were also observed in other alleles of the same allele group. Some interesting cases are described in more detail here.

B*07:436 showed, compared to B*07:02:01:01, 7 nucleotide differences in the 3′ UTR part between position 3039 and 3698 that we added to the already existing part. The 3′ UTR sequence of this B*07:436 was rather different from any B*07 allele, and comparison with other B alleles showed that this sequence was identical to many of the B*37 alleles. The B*07:436 allele was previously known as B*07:02:06, because there were only 3 silent substitutions found in exon 4 compared with B*07:02:01:01 (Anholts et al. 2009). Sequencing of the other exons revealed non-silent differences in exons 5 and 7, and therefore the allele was renamed in 2021 (Barker et al. 2023). In fact, comparison of the full-length sequences of B*07:02:01:01, B*07:436, and all B*37 alleles clearly shows that the 5′ end of B*07:436 is identical to B*07:02:01:01 up to position 1526, whereas the 3′ end is identical to many B*37 alleles up to position 1411, suggesting that this allele B*07:436 is a recombinant between B*07:02:01:01 and one of the B*37 alleles with the break point somewhere between location 1411 and 1526. Since we had offspring of this individual, we were able to determine the haplotype on which B*07:436 was located, being A*01:01:01, B*07:436, C*06:02:01, DRB1*15:01:01, DRB5*01:01:01, DQA1*01:02:01, and DQB1*06:02:01. Concerning the B~C association, the C*06:02 is very common present with B*37 and only rare with B*07 in the European population (Gragert et al. 2013), concordant with a recombination event.

In line with our previous study (Voorter et al. 2018), we observed differences in the introns and 3′ UTR in the B*18 group. The differences are limited to certain positions and are in fact pointing to a lineage evolutionary origin of the B*18. To investigate whether there is indeed a separation into two lineages, we prepared a phylogenetic tree using the sequences available in the IPD-IMGT/HLA database from 5′ UTR, introns, and 3′ UTR from position −20 up to position 3500, including all B*18 alleles of which these sequences are known. Figure 3 shows a clear separation between all B*18:01:01:01 and B*18:01:01:02 like sequences that have differences in intron 3 position 1127 (T/C), intron 5 position 2180 (A/G), and 3′ UTR positions 3014 (T/C), 3358 (C/T), and 3472 (C/T). Positions 3609 (C/T) and 3668 (C/A), which also showed clear association with the two lineages, could not be taken into account, because there are too many alleles of which this sequence part is not known.

Fig. 3figure 3

Phylogenetic tree of HLA-B*18. A multiple sequence alignment was performed on the 5′ UTR, intron, and 3′ UTR sequences from position −20 up to 3500 from all B*18 alleles for which these sequences were available in the IPD-IMGT/HLA database (vs. 3.51). Details of the alleles are listed in Supplemental Table 1C. Of the two alleles outside the two clusters, the B*18:03:01:02 has 2 nucleotides identical to B*18:01:01:01 (1127 T, 3358C) and 3 identical to B*18:01:01:02 (2180G, 3014C, 3472 T), and the B*18:01:01:18 allele has 3 nucleotides identical to B*18:01:01:01 (1127 T, 2180A, 3472C) and 2 identical to B*18:01:01:02 (3014C, 3358 T). The scale bar indicates the length of the tree edges, corresponding to the differences between two allele sequences as calculated by the R SeqinR package (Charif and Lobry 2007)

留言 (0)

沒有登入
gif