Rare tandem repeat expansions associate with genes involved in synaptic and neuronal signaling functions in schizophrenia

We used WGS data from 1159 schizophrenia cases and 936 ancestry-matched population controls from Sweden that were previously generated by our group [19] (Fig. 1). To estimate population frequency of tandem repeats, we included WGS data for 2504 genomes sequenced by the 1000 GP [27]. All WGS data were sequenced on Illumina platforms using 150 bp paired-end reads to a similar mean coverage per sample.

Fig. 1: Study overview.figure 1

This flowchart summarizes the study design and analytic workflow.

Identification of novel tandem repeats expanded in schizophrenia

Using EHdn [28] we performed genome-wide detection of tandem repeats that have motifs between 2 and 20 bp and total length greater than the read length (150 bp) across all the samples. Seven samples (5 schizophrenia cases and 2 controls) were removed because they were outliers in terms of genome-wide tandem repeat count (Supplementary Fig. 1), resulting in 4592 genomes for subsequent analyses (1154 schizophrenia cases, 934 controls, and 2504 samples from 1000GP).

We defined a tandem repeat-containing region as a genomic location containing tandem repeats with one or more different motifs overlapping by at least 1 bp. Across all 4592 samples, we identified 21,153 unique tandem repeat motifs in 16,723 distinct regions (Supplementary Table 1). The technical characteristics of our tandem repeat data, including the distributions of motif, motif size, and tandem-repeat-containing regions, were consistent with those reported previously [16, 18] (Supplementary Figs. 24). Tandem-repeat-containing regions were enriched in GC-rich regions (odds ratio [OR] = 1.04, P < 2.2e-16) but depleted in conserved DNA sequences defined by phyloP [31] (OR = 0.015, P < 2.2e-16) and phastCons [32] (OR = 0.208, P < 2.2e-16, Supplementary Table 2, Supplementary Fig. 5). We compared our data to the known simple sequence repeat regions in the human reference genome and to tandem repeat loci reported in ASD [16]. Of the 16,723 tandem-repeat-containing regions reported here, 3447 (20.6%) of them have not been previously reported (Supplementary Fig. 6).

Pathogenic TREs are typically significantly longer than what is observed in the general population. For example, patients with fragile X syndrome typically carry >200 CGG repeats in the 5’ UTR of FMR1, while unaffected individuals generally have 6 to 53 repeats. Following Trost et al. [16], we defined a TRE as a tandem repeat that is much larger than most other members of the study cohort. We applied DBSCAN [29], a non-parametric clustering algorithm, to tandem repeat calls across the 4592 post-QC samples. Outliers for repeat length of each tandem repeat motif were deemed to be TREs. A total of 2890 TREs were identified, and 1559 (53.9%) of those were novel in comparison to the TREs reported in Trost et al. [16].

We deemed TREs in the schizophrenia case-control cohort to be rare when found in less than 0.1% of the 1000 GP samples. This resulted in 603 rare TREs for subsequent burden testing (Supplementary Table 3). We examined the distribution of the count of rare TREs per sample using a stratified histogram (Supplementary Fig. 7). We did not observe any samples that were outliers based on rare TRE count (Supplementary Fig. 7).

Contribution of rare tandem repeat expansions in schizophrenia

To assess the possible functional effects of rare TREs, we used burden testing to evaluate whether rare TREs are enriched in different genomic annotations in schizophrenia cases versus controls. Only autosomal TREs were retained for burden analysis. Our power calculation suggested that we had ≥80% power to detect association signals with burden testing when the aggregated minor allele frequency was 0.01 (i.e. aggregated minor allele count of 20), the genotypic relative risk was ≥4.9, and assuming a type I error level of 1×10−5 (Supplementary Fig. 8). Burden testing was performed using logistic regression models that allowed us to correct for confounding factors that may cause spurious association signals as described in Supplementary Methods. To identify potential confounding variables, we carried out a principal component (PC) analysis of the normalized anchored in-repeat-read counts which suggested the inclusion of PC2, PC3, and PC8 as covariates in the logistic regression models (Supplementary Fig. 9 and Supplementary Table 4). After correcting for these PCs along with sex, we did not find evidence of inflation based on the estimated effect measured by the burden of rare TREs in intergenic regions (see below Genome-wide burden), which is consistent with a prior report [16]. Furthermore, for burden testing in target regions (i.e. global genic regions, gene parts, gene sets, epigenomic annotations), we additionally included global intergenic burden as a covariate in the logistic regression models to correct for any confounding factors that may have not been accounted for by the PC2, PC3, and PC8, and to increase the specificity of the tests in target regions.

Genome-wide burden

We first compared the total number of rare TREs in schizophrenia cases versus controls in three ways: genome-wide, in genic regions only, and in intergenic regions only. When a rare TRE overlapped with any gene/transcript by at least 1 bp, it was defined as genic; otherwise, it was intergenic. We observed a significantly increased burden in schizophrenia for rare TREs (OR = 1.403, P = 0.004) and genic rare TREs (OR = 1.549, P = 0.003; Fig. 2). No statistically significant difference between cases and controls was detected for rare TREs in intergenic regions (OR = 1.182, P = 0.232; Fig. 2). To understand how repeat motifs may influence case-control burden comparison, we stratified genic TREs into CpG-containing (any rare genic TRE that contains the sequence of CG, GC, CGC, CGG, CCG, GCC, GGC or GCG) or non-CpG-containing, and performed burden testing for each group separately. There were 249 CpG-containing rare genic TREs and 694 non-CpG-containing rare genic TREs across 1,154 schizophrenia and 934 control samples. The burdens of both CpG-containing and non-CpG-containing genic rare TREs were significantly increased in schizophrenia cases compared to controls. (Supplementary Table 5).

Fig. 2: Genome-wide burden of rare TREs in schizophrenia.figure 2

The error bars denote 95% confidence interval.

Given our finding that tandem repeats were depleted in conserved DNA sequences, we next compared the total conserved base pairs affected by rare TREs in cases versus controls and observed a modestly elevated burden in schizophrenia (OR = 1.005, P = 0.021; Supplementary Table 6). We further performed burden analysis for coding conserved base pairs and non-coding conserved base pairs separately and found that the elevated burden is significantly contributed by conserved base pairs in non-coding regions (Supplementary Table 6).

We estimated the proportion of samples carrying rare TREs using the residuals of rare TRE counts after controlling for confounding factors. Using this approach, the estimated sample proportions were 11.35% in schizophrenia cases and 4.39% in controls, i.e., a 6.96% excess in schizophrenia (Wilcoxon ranked sum test P = 8.83e-9).

Burden in different parts of genes

Previous studies found that, in ASD, rare exonic TREs and rare TREs affecting splicing were enriched, while de novo tandem repeat mutations were enriched in brain regulatory regions [16, 17]. Motivated by these examples, we first compared the total number of rare TREs in schizophrenia cases versus controls in different parts of protein coding genes (Fig. 3 and Supplementary Table 7). Interestingly, we found a higher burden of rare intronic (OR = 1.436, P = 0.030) and rare splicing TREs (OR = 2.174, P = 0.024), although the excess was not statistically significant after multiple testing correction (BH- corrected P = 0.105 for both).

Fig. 3: Network of genes with rare TREs that are differentially expressed in schizophrenia.figure 3

Interactions between genes were extracted from GeneMANIA. Top 50 scoring connected genes were pulled from pathways or physical interactions from the network data with automatic weighting. Node color: gray for known schizophrenia genes (defined in Methods), red for eQTL genes with rare TREs that are differentially expressed in schizophrenia (CMC genes with eQTLs and rare TREs); and blue for additional connected genes (top 50 genes) pulled from GeneMANIA. Node size is proportional to the strength of predictions for a given gene function.

Burden in brain epigenomic annotations

We then compared the total number of rare TREs in schizophrenia cases versus controls within functional annotations experimentally derived from human brain tissue known to affect gene expression (Supplementary Table 8). These annotations include open chromatin regions from ATAC-seq [33], chromatin binding factor CTCF from ENCODE [34], boundaries of topologically associating domains (TADs) [35] including level 1 sub-TAD boundaries and level 2 sub-TAD boundaries, differential neuronal cell specific histone modifications (H3K27ac and H3K4me3) peaks [36], neuronal frequently interacting regions (FIRE, superFIRE) [37], and neuronal chromatin interactions [37]. The bin size was 40 kb for sub-TADs and FIREs, 10 kb for enhancer-promoter annotations, and for the remaining annotations ranged from 0.126 kb to 880 kb. We leveraged neuronal annotations when available as previous work has indicated neurons as the central cell type harboring genetic risk for schizophrenia [5, 36,37,38]. We observe a higher burden of rare TREs in sub-TAD boundaries – level 1 (OR = 4.977, P = 0.012), sub-TAD boundaries – level 2 (OR = 4.476, P = 0.022), and enhancer-promoter anchors (OR = 1.506, P = 0.026) in schizophrenia cases, but they were not statistically significant after multiple testing correction (BH-corrected P = 0.105, Supplementary Table 8).

Burden in gene sets previously implicated in schizophrenia and neurodevelopmental disorders

In fragile X syndrome, both abnormally expanded CGG repeats and point mutations in FMR1 have been reported [39]. Motivated by this example, we hypothesized that schizophrenia-associated TREs may be enriched in genes known to increase risk for schizophrenia, previously identified via common variant association [5], copy number variation [4], exome sequencing [6], or gene expression studies [40, 41]. Given the known genetic overlap between schizophrenia and ASD [22, 23], we also included risk genes previously implicated in neurodevelopmental disorders via exome sequencing [42,43,44], copy number variation [45,46,47] or tandem repeats studies [6]. Burden testing compared the total number of rare TREs in cases versus controls within each of the 21 gene sets considered (Methods, Supplementary Table 9). We found an excess of rare TREs in schizophrenia cases in brain-expressed genes that are differentially expressed (DEGs) between schizophrenia and controls as determined by the Common Mind Consortium (i.e. CMC brain DEGs; OR = 6.63, P = 0.005, BH-corrected P = 0.063), the genes with expression quantitative trait loci (eQTLs) in human brain as identified by PsychENCODE Integrative Analysis (OR = 1.73, P = 2.14e-3, BH-corrected P = 0.063), as well as in the SynGO ontology category postsynapse process (OR = 27.94, P = 0.004, BH-corrected P = 0.063). These are further supported by the fact that many of the CMC brain DEGs with rare TREs and the genes with brain eQTLs overlapping with CMC brain DEGs are highly connected with other known schizophrenia genes and they are involved in similar functions in synaptic or neuronal signaling (Fig. 3). While examining specific genes within the three significant gene sets, we found that the rare TREs mostly affected introns (Supplementary Table 10).

To further explore the potential mechanisms by which TREs may regulate the underlying genes, we compared the level of mutational constraints (determined by the observed over expected number of loss of function variants in gnomAD [48]) within CMC DEGs, within brain eQTL genes and in all genes with and without rare TREs found in schizophrenia cases. We found that rare TREs impacting CMC brain DEGs were significantly more constrained than other CMC brain DEGs (Fig. 4a) and rare TREs impacting genes with eQTLs are significantly more constrained than other CMC brain DEGs (Fig. 4b). We also found that the genes with rare TREs found in schizophrenia cases were significantly more constrained than genes without rare TREs found in schizophrenia cases (Fig. 4c).

Fig. 4: Constraint scores for genes with and without rare TREs.figure 4

Constraint scores (extracted from gnomAD’s observed/expected (o/e) upper bound LOEUF values) of genes with rare TREs are compared against those of the other genes without rare TREs in schizophrenia. A Only genes differentially expressed between schizophrenia and controls as determined by the Common Mind Consortium are compared, B only genes with eQTL are compared and (C) comparison is done for all protein coding genes. Box plots show Q1-1.5×IQR, Q1, median, Q3 and Q3 + 1.5×IQR. P-values reported were calculated from one-sided Wilcoxon rank-sum test assuming lower gnomAD o/e upperbound in genes with rare TREs. In all three categories assessed, the genes with rare TREs found in schizophrenia were more constrained (i.e., had on average lower LOEUF values) than the genes without rare TREs found in schizophrenia.

Our study is underpowered to detect individual loci of rare TREs associated with schizophrenia at genome-wide significance (Supplementary Fig. 8). We compared the individual rare TRE loci from our data to the 39 top-ranking rare TRE loci identified by Mojarad et al. [18]. Nine of the 39 loci were found in the schizophrenia cases from this study (Supplementary Table 11). To explore whether pleiotropic TRE loci may exist across ASD and schizophrenia, we compared the individual rare TRE loci from our data to the top 57 rare TRE loci identified by Trost et al that were in the top two gene sets enriched in ASD and had a higher frequency in individuals with ASD than siblings without ASD [16]. 14 of the 57 ASD loci were found in the schizophrenia cases from this study (Supplementary Table 11).

Confirmation of EHdn detected tandem repeats

The detected tandem repeats from EHdn were validated using multiple complementary methods by Trost et al, which showed that the validation rate of detected tandem repeats was 77% compared to long-read sequencing [16]. In this study, we used an identical pipeline as Trost et al. [16], so we expected the same rate of validation in our calls. Furthermore, we selected two TRE regions for validation based on (1) high odds ratio from burden testing, (2) located close to genes of interest (i.e. loss of function intolerant genes and/or known risk genes for schizophrenia), and (3) absence of complex repetitive sequences nearby. We attempted to confirm TREs in the chosen regions by Integrative Genomics Viewer (IGV) visualization of WGS reads and gel electrophoresis size separation of PCR products of the regions of interest. All TREs detected were confirmed (Fig. 5 and Supplementary Fig. 10).

Fig. 5: Confirmation of EHdn detected tandem repeats.figure 5

A, B, Images of the gel electrophoresis showing bands that correspond to the expanded alleles in schizophrenia (SCZ) cases and the unexpanded allele in the reference sample NA12878. A 100 bp ladder is shown for size reference.

Replication

We sought replications for significant associations detected in our Swedish data in an independent dataset from Canada that consisted of 252 schizophrenia cases of European ancestry and 222 ancestry-matched controls [18]. We excluded post-synapse genes from final testing because the number of rare TREs observed in this gene-set in the replication samples was too small (<5). For each of the remaining significant associations, we conducted an association analysis in the replication samples and then a meta-analysis of the Swedish and independent Canadian samples for the total sample size of 1,436 schizophrenia cases and 1,156 controls. As shown in Supplementary Table 12, there is a high concordance between discovery and replication samples and all significant associations detected in our discovery samples have been replicated.

留言 (0)

沒有登入
gif