Huntington’s disease age at motor onset is modified by the tandem hexamer repeat in TCERG1

Subject details

We analysed genetic and phenotypic data of 506 patients with HD from the EHDN REGISTRY study (http://www.ehdn.org16; initially we had 507 individuals, but then we excluded one individual with unreliably called TCERG1 QTR due to low sequencing depth coverage), and 104 individuals from the PREDICT-HD study50. Ethical approval for Registry was obtained in each participating country. Investigation of deidentified PREDICT-HD subjects was approved by the Institutional Review Board of Partners HealthCare (now Mass General Brigham). Participants from both studies gave written informed consent. Experiments were conducted in accordance with the declaration of Helsinki and ethical approval was Cardiff University School of Medicine SMREC 19/55.

DNA of the 506 REGISTRY HD individuals was provided by BioRep Inc. (Milan, Italy) from low-passage lymphoblastoid cells. For most of our HD patients (496 individuals), we measured the length of uninterrupted HTT exon 1 CAG repeat using an Illumina MiSeq platform51. For the remaining 10 individuals, we used BioRep CAG lengths determined using Registry protocols (https://www.enroll-hd.org/enrollhd_documents/2016-10-R1/registry-protocol-3.0.pdf). For individuals from the PREDICT-HD study, DNA was obtained from blood DNA and we used the CAG length recorded in the study. SNV genotype data were available for 468 of the REGISTRY individuals, as part of the GeM GWAS7.

Age at onset was assessed as described in Ref. 11. For REGISTRY age at motor onset data, where onset was classified as motor or oculomotor by the rating clinician, the clinician’s estimate of onset was used for onset estimation. For all other onset types, we used the clinical characteristics questionnaire for motor symptoms. PREDICT-HD age at motor onset was as recorded in the study, determined using the age where the diagnostic confidence level = 4. The selection of the REGISTRY and PREDICT-HD samples are described in detail in Ref. 11. Briefly, the REGISTRY samples were selected for having extreme early or late onset compared to that predicted by their CAG length. The PREDICT-HD samples were selected based on extreme predicted early or late onset. These originally constituted 232 individuals, of whom we analysed on those 104 who had a known age at motor onset.

Calling tandem hexamer from whole-exome-sequencing (WES) data

For the Registry-HD cohort (N = 506), sequencing was performed at Cardiff University15. Whole-exome libraries were generated using TruSeq® rapid exome library kits (Illumina, 20020617) according to Illumina protocols (https://emea.support.illumina.com/downloads/truseq-rapid-exome-library-prep-reference-guide-1000000000751.html). Libraries were sequenced on a HiSeq 4000 using 75 bp paired-end reads. For the PREDICT-HD participants, an in-solution DNA probe based hybrid selection method was used to generate Illumina exome-sequencing libraries. A HiSeq 2500 was used to generate 76 bp paired end reads. De-multiplexed reads for both sets of exomes were aligned using BWA v0.7.5a52, generating variant-ready binary alignment (BAM) files which were used for STR/QTR calling. Individuals with more than one sequencing run were merged into a single BAM file. The human genome assembly hg19 was used for sequence alignment. The genotyping was performed using universal variant caller (UVC) software openly available at https://github.com/LobanovSV/UVC.git. This software allows to call SNPs, INDELs, STRs, QTRs and any their combination in several steps: (i) align single reads to the reference genome using unique matching algorithm; (ii) remove reads with bad alignment score; (iii) find all possible combinations of insertions and deletions appearing from single read alignment; (iv) re-align reads to all of those combinations and choose the best one (i.e. having the smallest mismatch error); (v) correct alignment of single reads using their pairs as well as other reads; (vi) split reads into two (if possible) groups of similar size but with different allele sequence; (vii) correct read alignment using alignments of other samples. To align single reads to the reference genome, we construct the match matrix (Supplementary Fig. 9) and select the path minimising the mismatch error \(} = } + \mathop \nolimits_i^N } _}}} \cdot \exp \left( \right)\) Here, m is sum of mismatch nucleotides, N is number of gaps, Δi is the gap height (the distance between two match pieces adjoining the i-th gap), and li is minimal length of two match pieces. The mismatch error of this form takes into account the highly mutative nature of STRs/QTRs and allows to unbiasedly align reads with any combination of SNPs, INDELs, STRs and/or QTRs. For instance, the naive straight red line in Supplementary Fig. 9 has the mismatch error M = 2 (m = 2, N = 0), whereas the correct blue path with 3 hexamer deletion has much smaller mismatch error M = 6.2 × 10−13 (m = 0, N = 1, ∆i = 18, li = 36). If the right adjoining piece is located above the left one (for example, the break of the blue line in Supplementary Fig. 9), the gap is attributed as a deletion. Conversely, the discontinuity is attributed as an insertion. Finally, we create an alignment track with rows containing sequences of mapped paired reads. To simplify genotyping, we expand the sequence of the reference genome by inserting asterisks to the loci at which the reads have insertions. Conversely, we substitute nucleotide deletions in the reads by asterisks. This manoeuvre permits insertions and deletions to be treated as substitutions. After that we consider loci where some sequence reads have nucleotides different from the reference. We utilise these loci to retrieve the allele sequences by separating the reads into two groups in such a way that all reads in a single group have the same nucleotides at these loci.

Sanger sequencing to confirm QTR sequences

To validate our tandem hexamer calls from WES data, we performed Sanger sequencing of four samples: two homozygous for the reference QTR allele (A1/A1 genotype), one heterozygous for a shorter QTR allele (A1/A2 genotype), and one heterozygous for a longer QTR allele (A1/A4 genotype). The QTR locus in TCERG1 was amplified by PCR using forward (5′-AACTGACACCTATGCTTG-3′) and reverse (5′-GTTGAAGTGGATACTGCA-3′) primers as described in the reference9. Amplicons were Sanger sequenced (LGC, Germany) in both directions using forward (5′-AACTGACACCTATGCTTGCAG-3′) and reverse (5′-GAAGTGGATACTGCAGGTGC-3′) primers, and sequences compared to their respective calls from short-read exome-sequencing data. Sequences from Sanger and exome-sequencing matched in each of the four cases.

Measuring TCERG1 QTR lengths using capillary electrophoresis

To confirm TCERG1 QTR lengths derived from exome-sequencing data, the QTR locus in TCERG1 was amplified by PCR using a fluorescently labelled forward (5′-FAM-AACTGACACCTATGCTTG-3′) and unlabelled reverse (5′-GTTGAAGTGGATACTGCA-3′) primer before sizing by capillary electrophoresis (ABI 3730 genetic analyzer) and Genescan against a LIZ600 ladder of size standards (Thermofisher). In total we tested QTR length calls for 101 individuals from the Registry-HD sample: the 73 who had at least one non-reference QTR length allele (A2–A8) and 28 who were called as homozygous for the reference (A1) allele. The reference allele A1 was predicted to produce a PCR fragment of 307 bp. In all samples, this allele was consistently sized at 299 bp by capillary electrophoresis. We attributed this to the repetitive nature of the sequence and the specific analyzer used. In all 101 individuals tested, allelic QTR lengths relative to the reference A1 allele QTR length exactly matched those called using exome-sequencing data.

Calculation of age at onset residuals

Expected ages of onset were calculated from patient CAG length data (measured as described above) using the Langbehn model53. Residual ages at motor onset were then calculated taking the difference between the expected onset from the recorded clinical age at motor onset, as performed elsewhere7.

Association of age at onset with STR/QTR repeats

Linear regression was performed of the age at onset residual on the repeat statistic (sum, diff, max, min, #3 rep). Since the sample was selected to have extreme values (positive and negative) of this residual, linear regression is likely to overestimate the effect of the repeat on onset in the general HD patient population. Therefore, regression with selection (see “Regression with selection” section below) was used to estimate the true effect size. A dichotomous phenotype was derived by selecting individuals with extreme late (positive residual greater than a pre-defined criterion) or early (negative residual less than a pre-defined criterion) onset. Association of the dichotomous phenotype with repeat statistic was tested via logistic regression.

To formally test which repeat statistics best predict age at onset, we proceeded as follows: For each pair of statistics A and B, a linear regression of residual age-at-onset on statistic A was performed as a baseline. Then statistic B was added to the regression and the significance of the improvement in fit assessed using ANOVA. Statistics were defined as “best fitting” if the addition of no other statistic gave a significant improvement in fit.

Regression with selectionInitial selection

We performed whole-exome sequencing of a small sub-group of the EHDN REGISTRY study. To increase statistical power, we selected individuals with the largest absolute value of the residual age at onset R.

The probability density function of the residual ages at onset in the selected sample p(R) is that of a normal distribution with mean 0 and standard deviation σ

$$p_\left( R \right) = \frac }} \cdot }^}}}$$

(1)

multiplied by a selection function

$$S\left( R \right) = \frac}}}\left( }}}} - \left| R \right|}}} \right)}}$$

(2)

and normalised to have unitary integral

$$p\left( R \right) = \frac\left( R \right) \cdot S(R)}}p_\left( R \right) \cdot S\left( R \right)}R}}.$$

(3)

Here, σ is the standard deviation of the initial HD population (EHDN REGISTRY study), Rthr is the selection threshold, and ∆ is the selection width, which was infinitely small, ∆→0.

The expected probability density of the initial HD population \(p_\left( R \right)\), selection function S(R), and expected probability density p(R) of the selected HD sub-group are shown in Supplementary Fig. 10.

Correction of the age at onset residuals

To improve the accuracy of the correction of age at onset for CAG length, we additionally measured the length of the uninterrupted HTT exon 1 CAG repeat using an Illumina MiSeq platform for 496 individuals from our HD cohort and corrected the HD age at onset residuals. Some individuals who had age at onset residual above the threshold Rthr shifted to the region with |R| below the threshold Rthr after correction. Conversely, some individuals who would have corrected age at onset residual above the threshold Rthr, were not selected because their uncorrected |R| were below the threshold Rthr. The correction has therefore widened the selection function, corresponding to a non-zero selection width ∆.

The probability density function p(R) of our HD group with corrected residuals can be modelled in the same way as described above, but with finite selection width ∆.

We estimated parameters of the selection function S(R) by minimising maximum absolute difference D between the expected \(F_e\left( \right)\) and observed \(F_o\left( \right)\) cumulative probabilities

$$D = \mathop }}}}}\limits_ \left| \right) - F_o\left( \right)} \right|.$$

(4)

The observed cumulative probability \(F_o\left( \right)\) and expected one \(F_e\left( \right)\) with optimal parameters \(\sigma = 7.02,R_}}}} = 17.6,\Delta = 3.30\) are shown in Supplementary Fig. 11a. The one-sample Kolmogorov–Smirnov p-value is 0.81. The selection and probability densities are shown in Supplementary Fig. 11b, c.

Likelihood function

In the linear regression, the errors

$$\varepsilon _i = R_i - (\beta _0 + \beta _1 \cdot x_i)$$

(5)

are normally distributed \(\varepsilon _i\sim (0,\sigma ^2)\) and are independent across individuals. The likelihood \(}}}_}}}}\left( \right)\) is

$$}}}_}}}}\left( \right) = p_\left( \right),$$

(6)

where \(p_\left( \right)\) is the probability density function of normal distribution, Eq. (1). Here, σ, β0, and β1 are unknown standard deviation, intercept, and effect size, respectively; Ri and xi are age at onset residual and sum of TCERG1 QTR lengths of a specific individual.

In the regression with selection, the distribution of the errors differs between individuals. The likelihood \(}}}\left( \right)\) is

$$}}}\left( \right) = \frac\left( \right) \cdot S(R_i)}} } \left( \right) \cdot S\left( R \right)}R}}.$$

(7)

Note, the integral in the denominator depends on the individual’s sum of QTR lengths xi. Here, S(R) is the selection function, Eq. (2).

We estimated the unknown parameters σ, β0, and β1 by maximising the likelihood function of all observations

$$}}}\left( \right) = \mathop \limits_i }}}\left( \right)$$

(8)

and used the likelihood-ratio test (comparing the likelihood maximised over σ, β0, β1 to that maximised over σ and β0, holding β1 = 0) to obtain the significance of the association.

Analyses to test for correlation between genetically predicted expression and age at onset

FUSION22 was used to perform TWAS analyses on the PsychENCODE data using pre-computed predictors downloaded from http://resource.psychencode.org/. Summary Mendelian Randomisation was used to perform TWAS analyses on eQTLGen blood expression using cis-eQTL data downloaded from https://www.eqtlgen.org/cis-eqtls.html . Co-localisation analyses to test if eQTL and age at onset signal share the same causal SNV were performed using COLOC54.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

留言 (0)

沒有登入
gif