This study is based on the iPSYCH2015 case-cohort sample29, an expanded version of iPSYCH2012, which has been previously described in detail30. In brief, the base population is defined as all 1,657,449 singleton births that occurred in Denmark between May 1, 1981, and Dec 31, 2008, who were alive and residing in Denmark on their first birthday and had a mother registered in the Danish Civil Registration System31. From the base population all persons who received a diagnosis of a major psychiatric disorder (as specified below) no later than Dec 31, 2015, were included in the case sample, N = 92,531 individuals. Then, a randomly selected population-representative cohort of N = 50,615 individuals was drawn from the base population, including 3030 who overlapped with the case sample. Individual diagnosis sample counts are as follow: SSD (ICD10 F20–F29; n = 16,008), MDD (ICD10 F32–F33 and ICD 8 296.09, 296.29, 298.09, and 300.49; n = 37,555), ASD (ICD10 F84; n = 24,975), or ADHD (ICD10 F90; n = 29,668).
We also assessed three other brain disorders; intellectual disability (ID), epilepsy, and Tourette syndrome (TS), with prior evidence of association with NRXN1 deletions16,17,18, using information on hospital diagnoses that had been obtained through the Danish Psychiatric Central Research Register32 and the Danish National Patient Registry33 for other iPSYCH2015 studies. The diagnostic codes used to identify individuals with these disorders were as follows: ID (ICD10: F70-F79; ICD8: 311-315), epilepsy (ICD10: G40; ICD8: 345 (excluding 345.29)), TS (ICD10: F95.2).
Supplementary Table 2 provides carrier count for each diagnosis, as well as a subset by subcohort (iPSYCH2012 or iPSYCH2015i) and gender.
Genotyping was performed using Illumina microarrays and has been described elsewhere30. Notably, the genotyping was performed on dried blood spot samples taken at birth. iPSYCH2012 and the additional extension (iPSYCH2015i) were genotyped using two different arrays, PsychArray version 1.0 and Global Screening Array version 2 (GSA), respectively. B allele frequency (BAF) and logR ratio (LRR) values were extracted using GenomeStudio and samples with a genotyping call rate below 95% were excluded.
CNV calling and pre-processingCNVs were called using PennCNV34 as described in our previously published CNV calling and processing protocol35. All steps of the calling pipeline were run using the Singularity container provided in the protocol. In brief, the intensity files were filtered to include only biallelic autosomal SNPs mapping uniquely to the Haplotype Reference Consortium (HRC) hg19 reference map36, with a minor allele frequency of at least 0.1%, which yielded 280,700 and 509,754 probes for the PsychArray and GSA, respectively. Next, PennCNV calls were obtained with the script “detect_cnv.pl” setting a minimum number of probes (--minsnp) at 5, and the minimum length (--minlength) at 1000 bp. We then merged adjacent calls, with the PennCNV script “clean_cnv.pl” using the settings “--fraction 0.2 --bp” whereby two calls are merged if the gap between them corresponds to less than 20% of the combined length (in base pairs) of the calls. After CNV calling, we excluded samples with high levels of noise from the analysis. Thus, samples were excluded if they had either a LRR standard deviation value ≥ 0.35, BAF drift ≥ 0.005 or |GCWF | ≥ 0.02.
The locus of interest was defined as the NRXN1 gene in Ensembl8 GRCh37 (https://grch37.ensembl.org/Homo_sapiens/Info/Index) plus 0.5 Mbp upstream and downstream of the gene boundaries (chr2:49 645 643-51 759 674). Any CNV call overlapping the region by at least 0.1% of its length was selected for visual validation using the function “select_stich_calls()” from the R package QCtreeCNV35; this step also removed CNV smaller than 10 SNPs. Visual inspection was performed independently by two analysts as already described35. The boundaries of true CNVs were manually adjusted if necessary and any discordant call between the analysts was re-evaluated in a final joint session.
CNV analysisThe genomic coordinates of NRXN1 exons and transcripts were extracted using Ensembl8 GRCh37 (https://grch37.ensembl.org/Homo_sapiens/Info/Index). We decided to focus on protein-coding transcripts only and thus selected all 9 transcripts with a protein match in UniProt37 (https://www.uniprot.org/), yielding a total of 41 unique exons.
Under the assumption that exons mapping close to each other are likely to be deleted by the same CNVs, we investigated if any larger pattern was present at the level of the whole gene. We computed a genomically ordered correlation matrix across all exons, defined as an N × N matrix where N is the number of exons and the cell xy is the number of times a CNV affecting exon x also affects exon y.
CNVs are not equally distributed across the locus. We explored this topic using an IOU matrix, defined as an NxN matrix where N is the number of CNVs (381) and the cell xy is the IOU (Intersection Over the Union) score for CNVs x and y. IOU is 1 for two identical segments and ranges between 0 and 1 for any two overlapping segments, while non-overlapping segment pairs have an IOU range from 0 to approaching an asymptote at −1 the farther apart the two segments are. We then subgrouped exons in “alpha” and “beta” regions, based on Fig. 1d and previous literature38, corresponding to exons ENSE00001682911 to ENSE00002460080 (beta), and exons ENSE00002453754 to ENSE00001547151 (alpha). For the purpose of the secondary analysis (Table 1), deletions affecting exons from both groups were assigned to “alpha”.
Fig. 1: NRXN1 deletions similarity matrices, and NRXN1 correlation matrix. Note that the NRXN1 gene is encoded on the reverse strand, meaning the alpha promoter region (5′ of the gene) is shown on the right in this figure (see panel c for a breakdown of the gene structure).a Similarity heatmap for all deletions in the neurexin locus. Similarity is measured as IOU (intersection over the union), as described in the methods. Each row represents a deletion. Deletions are ordered on the x-axis based on the genomic position of the respective centre. Note that the scale is not linear as CNVs are not distributed equally across the locus. b Positional similarity for intronic deletions. This makes more evident the large group of very homogeneous deletions (marked with the orange bar on the x-axis). This group is referenced as segregating in the main text. c Distribution of the centre position for all exonic deletions in the NRXN1 gene locus. A schematic of the main gene isoform is aligned below the x-axis. The green and red bars mark the two exonic groups described in (d). d Exon correlation matrix. Exons are ordered based on genomic location. Note that the scale is not linear as exons are not distributed equally across the locus (see c). The red bar marks the exons in the “alpha” group and the green in the “beta” group. e A different view on the NRXN1 gene, the top blue graph shows all exons used in the study, while the bottom shows the top isoform.
Table 1 NRXN1 deletions and associated risk of psychiatric disordersSegregating deletion analysisThe coordinates of the segregating NRXN1 deletion found in Rujescu et al.25 were lifted over from hg18 to hg19 using the online tool LiftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver).
To identify SNPs in high linkage disequilibrium (LD) with the segregating deletion, we performed an association analysis (using the “--assoc” command in PLINK39,40 with default settings, Supplementary Fig. 2) where we compared the 100 identified carriers with 5000 randomly drawn non-carriers, across all SNPs with MAF > 0.01 and info >0.95 mapping on the entire chromosome 2, using an imputed genotype dataset of the iPSYCH201541. We then pruned the resulting SNPs with the following settings --clump-p1 0.00001 --clump-r2 0.8 --clump-kb 1000000.
The phased genotypes of the top 10 SNPs (shown in Supplementary Table 1) were imported in R. Here we constructed all possible haplotypes of length between two and five SNPs and tested their association with the deletion carriers using the R function fisher.test(). The haplotypes with an OR ≥ 2 and a p-value ≤ 0.0001 were further tested using the function roc() from the R package pROC42 to get the AUC (Area Under the Curve) value.
Statistical analysisWe derived population-based prevalence (with CI95%) for the different subgroups of NRXN1 deletions using the svydesign() and svyciprop() functions from the R package survey43, with finite population correction (FPC) to account for oversampling of cases in iPSYCH2015.
Briefly, we divided the post-QC number of cases (77,655) and individuals from the random population subcohort (43,311) with the total number of corresponding individuals in the source population (90,218 and 1,657,449) to derive the sampled population fractions; 0.85068 (100% of cases minus the ones failing genotype or excluded in QC) and 0.02613, respectively. Samples from overlapping individuals (cases-in-subcohort) were assigned the case population fraction (0.85068).
We calculated the corresponding prevalence of exonic NRXN1 deletions in the UKB directly from carrier counts provided by Crawford et al.44 and derived CI95% as follows (R pseudocode): CI95% = qbeta(c(0.05/2,1-0.05/2), nCarrier + 0.5, nTotal-nCarrier + 0.5), where nCarrier and nTotal indicate the number of carriers and the total number of assessed samples (421,268), respectively.
We compared the prevalence of exonic deletions in iPSYCH2015 and UKB with Welch’s test of the difference between two means assuming unequal variance. Briefly, we defined the difference; d = (|log(piPSYCH/pUKB)|), the standard error of the difference; SEd = √(SEiPSYCH2 + SEUKB2), and the p-value; P = 2*(1-pnorm(d/SEd)), where piPSYCH and SEiPSYCH, and pUKB and SEUKB, indicate the prevalence and standard error of prevalence in iPSYCH2015 and UKB, respectively.
To estimate the risk of index psychiatric disorders associated with NRXN1 deletions we ran a logistic regression analysis using gam() from the R package mgcv45. We used age, sex (at birth) and SNP array type as covariates, with a smoothed function to model the effect of age using the mgcv function s(). In each association, we included all cases for the phenotype of interest and all controls, defined as individuals not having any of the index diagnoses. For the later-onset disorders SSD, MDD and SCZ, we only included those controls who were at least as old as the youngest case. Multiple testing correction was applied to the table containing the results of all three analyses (Table 1) using the R function p.adjust(method = “fdr”). We then compared risk estimates with those reported in published case-control studies (in each case the study applying the largest case-control sample size for the respective disorder; only considering studies that controlled for genotyping array, when including samples genotyped on different arrays) using a Welch’s test in a similar way as described above for prevalence estimate comparison. We performed two additional sensitivity analyses, we ran the first model on the phenotype schizophrenia (ICD10, F20) instead of SSD, and we ran the last model on the European unrelated subset of iPSYCH201541.
To estimate the risk of the three other brain disorders associated with NRXN1 deletions we fitted a logistic regression model using case status for each of the four iPSYCH disorders (ADHD, ASD, MDD and SSD) as covariates in addition to age, sex (at birth) and SNP array type.
SoftwareAll analyses were performed on HPC running CentOS Linux 7. PLINK39,40 version 190b6.21, R46 version 4.0.5 and VCFtools47 0.1.17 were installed via the conda package manager (https://anaconda.org/). PennCNV34 version 1.0.5, bcftools48 version 1.14, htslib49 1.14 are a part of the container we used for the CNV calling described in the previous section, available on Docker Hub (https://hub.docker.com/r/sinomem/docker_cnv_protocol). For the analysis and the figures, we used the following R packages: data.table50, pROC42, survey43, mgcv45 and ggplot251.
Ethics statementThis study is in full compliance with all relevant ethical regulations including the Declaration of Helsinki. Access to the data and its use for research purposes was granted by The Danish Scientific Ethics Committee, the Danish Health Data Authority, the Danish Data Protection Agency, and the DNSB Steering Committee. For this study, the Danish Scientific Ethics Committee has, in accordance with the Act on Research Ethics Review of Health Research Projects (in Danish: Komitéloven), waived the need for informed consent in biomedical research based on existing biobanks.
留言 (0)