Data for this cohort, comprising of individuals with autism and their families, was collected from three major hospital sites in Korea: Seoul National University Bundang Hospital (SNUBH) served as the primary center, overseeing the study at Soon Chun Hyang University Hospital Bucheon (SCHBC) and Seoul Child Hospital (SCH). All recruited families were approved by the ethics committee of SNUBH, SCHBC, and SCH Institutional Review Boards (IRB) (SNUBH: B-1703–388-303 and B-2108–700-107; SCHBC: SCHBC 2018–04-020 and SCHBC 2022–04-016; SCH: P01-201908-BM-02 and P01-202111–21-003). This study adhered to the ethical standards of the Helsinki Declaration and informed consent was obtained from all study participants. We aimed to recruit participants with diverse clinical features and to achieve a balanced sex ratio among individuals with autism.
A note on terminologyIn this paper, we used the term “individuals with autism”, “autistic individuals”, or “autism cases/probands” for individuals clinically diagnosed with autism, according to Rolland et al. 2023 [19]. We used the term “non-autistic siblings and parents” to refer to siblings and parents of individuals with autism, who do not have clinical features which meet the diagnostic criteria. For the general population, we used the term “control population.” We must acknowledge that several of these individuals in the general population may also present with autism.
SamplesFor the Korean autism cohort, we collected data from Korean families with at least one child diagnosed with autism by clinicians. We collected DNA samples from whole blood and clinical phenotypes from participants. All phenotype information was cross validated by clinical specialists. The collected data were fully anonymized and handled in accordance with the biorepository’s standard operating procedures. A total of 1400 families (n individuals = 3730) were collected and their clinical data were used for phenotype analysis. Of these, WGS data was generated for 673 families, including 696 individuals with autism, 213 non-autistic siblings and 1346 parents (Additional file 2: Table S1) used for de novo and polygenic score analysis. Compared with the previous release [20], 39 families were newly added in the Korean autism WGS dataset. Of the 673 families, 21 (3.1%) were multiplex families with more than two autistic children.
For the Korean general population WGS data, we accessed the Korean Genome and Epidemiology Study (KoGES; The National Project of Bio Big Data) genomic resource. The KoGES study had collected data from clinically non-diagnosed adults, aged > 40 years from Ansan and Ansung [21]. We downloaded a joint VCF file and clinical data from The National Project of Bio Big Data (www.nih.go.kr/biobank/). A total of 2500 participants (1272 female and 1228 male individuals) were used for polygenic score analysis in this study (Additional file 2: Table S1).
For the SSC, and SPARK cohorts, we downloaded a joint VCF file and clinical data from SFARI Base (https://sfari.org/sfari-base). For the SSC cohort [14], we excluded twin and ancillary collection and employed only the simplex collection. We used total of 1855 families of European ancestry (n individuals = 6976) for genetic burden test and all fully phasable 4318 trios for de novo gene discovery analysis from the WGS data (v2019-05–12). We used phenotype data (v15.3) from a total of 2644 families (n individuals = 10,456), for subsequent phenotype analysis (Additional file 2: Table S1). Similarly, for the SPARK cohort [15], we used total of 2434 families of European ancestry (n individuals = 8863) for genetic burden test and all fully phasable 5683 trios for de novo gene discovery analysis from the WGS data (v1.1). We additionally utilized fully phasable 25,325 trios for de novo gene discovery analysis from SPARK WES data (v2). For phenotype analysis, 108,266 families (n individuals = 149,547) from SPARK phenotype data (v9) were used (Additional file 2: Table S1).
Sequencing dataFor Korean autism cohort, DNA was obtained from whole blood of the subjects and sequenced on Illumina Hiseq X at sequencing read depth 30x. We processed WGS data using the Illumina DRAGEN germline pipelines (v4.0.3) [22], and variant calling for the human reference genome version GRCh38. Multi-sample joint genotyping was conducted using iterative gVCF genotyper.
For KoGES, DNA was extracted from whole blood of subjects and sequenced on Illumina NovaSeq 6000 with an average of 30 × read depth. Sequencing reads were aligned to human reference genome GRCh38. Subsequent processing followed GATK Best Practices (v4.2.4.1). After joint genotyping of individual gVCF, variant quality scores were recalibrated by VQSR.
For SSC, DNA was obtained from whole blood of the subjects and sequenced on Illumina Hiseq X10. Sequencing reads were aligned to human reference genome GRCh38. Subsequent processing of the alignments followed GATK Best Practices (v3.5). After joint genotyping of individual gVCF, variant quality scores were recalibrated by VQSR and low-quality genotypes (GQ < 20; DP < 10) were converted to missing genotypes. Only variants with “PASS” entries in the FILTER column were used for the downstream analysis. For SPARK, DNA was obtained from saliva of the subjects and prepared with PCR-free methods and sequenced on Illumina NovaSeq 6000. Sequencing reads were aligned to human reference genome GRCh38. Subsequent processing of the alignments followed GATK Best Practices (v3.5). Joint genotyping of individual gVCF was conducted by GLnexus (v1.4.1).
Quality control for samples and variantsQuality control (QC) for samples and high-quality (HQ) variants was conducted by Hail 0.2 (https://hail.is/) and Peddy [23]. For sample QC, we checked the distribution of SNPs, INDELs, transition/transversion (ti/tv) ratio, genotype quality, and genotype depth in a sample level to see if there are outliers. We also calculated relatedness between individuals in a dataset and imputed biological sex and ancestry. For Korean autism and KoGES datasets, all samples passed the sample QC (Korean autism n = 2255; KoGES = 2500). For SSC and SPARK datasets, we excluded sex/pedigree mismatched samples and used only European ancestry.
For Korean autism dataset, we retained variants labelled as “PASS” in the DRAGEN hard-filter and excluded those occurring within low complexity regions (LCR). Additionally, multiallelic sites were split into biallelic sites. Prior to this, local allele expressions, including allele depth (AD) and localized phred-scaled genotype likelihood (LPL), were transformed into a global format. For homozygous reference calls, AD were substituted with an array filled with read depth (DP) for the reference alleles, with zeros for other alleles, and LPL were replaced with “NA.” Localized alternative alleles (LAA) were converted to local alleles array (LA) by adding zero to the first element of LAA. Furthermore, the maximum LPL value was added to the empty elements of LPL to convert them into PL. Following the multi-allele split, PL values for homozygous reference calls, initially marked as “NA,” were annotated with an array consisting of 0, genotype quality (GQ), and DP multiplied by 3. We also removed variants with allele length ≥ 50.
For detecting HQ rare variants, quality metrics in the whole filtering pipeline were optimized, according to the previously established practice [24]. We filtered variant calls with following cutoffs; heterozygous SNPs with QUAL ≥ 7.5, GQmean ≥ 36, and DPmean ≥ 34, 0.275 ≥ allele balance (AB) ≥ 0.725; heterozygous indels with QUAL ≥ 10.51, gDP ≥ 3, 0.214 ≥ AB ≥ 0.786; homozygous alternative SNVs with QUAL ≥ 20.3, AN ≥ 4312, AB ≥ 0.905, GQmean ≥ 15.7, DPmean ≥ 12, GQ ≥ 9, gDP ≥ 11; homozygous alternative indels with QUAL ≥ 24.78, AN ≥ 3504, GQmean ≥ 29, DPmean ≥ 11.55, GQ ≥ 1, gDP ≥ 5, AB ≥ 0.905. We further filtered variant calls with call rate < 10% and a Hardy–Weinberg equilibrium P < 1 × 10−12.
For detecting HQ common variants, we filtered variant calls using following criteria; GQ ≥ 20, DP ≥ 10, AB 0.2–0.8 for heterozygous calls and AB ≥ 0.95 for homozygous calls, call rate ≥ 95%, Hardy–Weinberg equilibrium P ≥ 1 × 10−6. We then utilized variants with internal unrelated AF more than 0.05.
For SSC and SPARK dataset, we excluded variants in LCR, split multi-allelic sites, and removed INDELs with allele length ≥ 50. To filter HQ common variants, we used the same filtering pipeline with Korean autism dataset. For SSC, we applied further filtering for HQ rare variants with QC metric cutoffs, referring to the previous work [24].
Identification of de novo variantsDe novo variants (DNVs) were called by the Hail built-in de_novo() function in the annotated variants with the internal allele frequency (AF) less than 0.001 and gnomAD (v3.1) AF less than 0.001 in the non-psychiatric disease subset. We used the default cutoffs of Hail de_novo() function for further filtering: ABparent ≤ 0.1, ABchild < 0.3, DPchild/(sum of DPparents) ≥ 0.3 and GQ ≥ 20.
For the Korean autism data, we modified the method to calculate the de novo probability [25] in Hail de_novo() function considering the partial origin in which DNV occurred. The modified approach calculates the probability that a DNV has occurred, together with the probability that it was inherited from parents. We obtained the de novo probability, for each of the following five conditions:
DNV was from mother. … a
DNV was from father. … b
The genotype of the mother was not homozygote, and the alternative allele was inherited. … c
The genotype of the father was not homozygote, and the alternative allele was inherited. … d
The genotype of the child was not heterozygote. … e
Among the probabilities obtained, the ratio of the probability that the variant represents a true DNV (referred to as the de novo probability) was calculated as follows:
$$De\ novo\ probability=\frac$$
Another consideration was the GQ scale of DRAGEN. The DRAGEN uses an algorithm that can reduce errors in actual data where correlations between reads are observed [22]. Therefore, DRAGEN genotypes have lower distribution of GQ than GATK. As such, we tried to adjust the threshold of de novo probability to rescue false-negative calls whose confidence was low due to the lower GQ distribution. To determine the optimal cutoff, we measured the number of obtained DNVs lowering the threshold from 0.5 to 0.05 and set the de novo probability threshold to 0.1. We further filtered DNVs found in less than five individual cases. This step identified 47,269 autosomal DNVs in individuals with autism (n = 696) and 14,309 in non-autistic siblings (n = 213) (Additional file 3: Table S2). Consistent with the previous genetic studies on autism [26,27,28,29], there was a positive correlation between paternal age and the number of DNVs for each sample (0.13 DNVs per paternal age month, P < 2.2 × 10−16) (Additional file 1: Fig.S1A).
For the SSC and SPARK data, we filtered high/medium confidence DNVs using the original cutoffs for the de novo probability. We excluded samples that presented with excessive number of DNVs due to pedigree errors. We filtered DNVs with internal AC = 1, unless the DNVs were identified in monozygotic twins. For SPARK, we further filtered DNVs with AB < 0.8. In the SSC data, we identified 119,785 autosomal DNVs in autism cases (n = 1838) and 95,874 in non-autistic siblings (n = 1504). In SPARK, we identified 167,623 autosomal DNVs in individuals with autism (n = 2532) and 103,861 in non-autistic siblings (n = 1591). We observed a positive correlation between paternal age and the number of DNVs per sample in both datasets (SSC – 0.13 DNVs per paternal age month, P < 2.2 × 10−16; SPARK – 0.13 DNVs per paternal age month, P < 2.2 × 10−16) (Additional file 1: Fig. S1A).
Variant annotationHQ variants were annotated with Hail vep() function with Ensembl variant effect predictor (VEP) version 109.3. With the most severe consequence term annotated by VEP, we classified variants into three categories, PTV, missense variant (MIS), and synonymous. PTV included the “frameshift_variant,” “splice_acceptor_variant,” “splice_donor_variant,” and “stop_gain” variants with high confidence by LOFTEE plugin [30] with no LOFTEE flags other than “SINGLE_EXON.” The “missense_variant,” “stop_lost,” “start_lost,” and “protein_altering_variant” were labelled as missense. Lastly, we defined the “synonymous_variant,” “stop_retained_variant,” and “incomplete_terminal_codon_variant” as synonymous.
Computation of PSUsing HQ common variants, we calculated the PS for autism. For autism, we used two European-ancestry Genome Wide Association Study (GWAS) summary statistics, one from Grove et al. [31], which includes SSC and iPSYCH data (n = 46,350), and the other which includes only the iPSYCH dataset (unpublished) (n = 58,948). We used the former base data for calculating PS in the Korean and SPARK cohorts and the latter for calculating PS in SSC, as the overlaps between the target data and the base data could inflate the polygenic signal.
To match the SNP format between the input data used for PS calculation, we conducted SNP matching for the target data and GWAS summary statistics. Prior to SNP matching, we carried over the target data from GRCh38 to GRCh37 to match the genome build of GWAS summary statistics. Then SNP matching was conducted as described below. We united the allele representation as “1st to 13th nucleotide + (length of allele – 13)” when the allele was longer than 13 bp. The INDELs which localized in the same locus but were reverse of each other, as well as ambiguous SNPs, , , , , were excluded. We then matched those with the SNPs list from the linkage disequilibrium (LD) reference comprised of HapMap3 SNPs from UKBB European individuals, provided from PRScs [32]. We used the European LD reference as it more closely matched the ancestry with the GWAS summary statistics. For target data which had different ancestry other than European, Korean autism, and KoGES, we harmonized AF by the chi-square test. If the AF difference of one SNP between the target and the reference was significantly different from the mean by more than 1 SD of all matched SNPs, the SNP was excluded.
We computed the PS using four different calculation methods: PRScs [32], SBayesR [33], LDpred2 [34], and PRSice [35]. The PRScs, SBayesR, and LDpred2 calculate PS by implementing Bayesian shrinkage of beta effect size of SNPs weighed by LD, whereas PRSice calculates PS by using several SNPs that pass the optimal P-value cutoffs. Given that the adjustment for polygenic risk using PRScs improved the prediction the most [36], we used PRScs as the primary calculation tool. Parallel computation of 22 autosomes was performed with default parameter: gamma distribution for local shrinkage (1, 0.5) and phi value for global shrinkage 1.0 × 10−2. The results of PS were consistent across four different methodologies (Additional file 1: Fig. S2). The correlation coefficients were especially high between shrinkage methods (PRScs, SBayesR, and LDpred2), on average 0.8, but between shrinkage methods and P-value cutoff method the correlations were lower than that, 0.6.
Sex-specific gene analysisTo identify autism-associated genes, we ran transmitted and de novo association gene discovery (TADA) analysis, Bayesian association algorithm [6, 37]. We used de novo PTVs and damaging MIS in all genes from full phasable trios with autistic child in Korean WGS (n = 696), SSC WGS (n = 2380), SPARK WGS (n = 3496), and WES (not overlapped with WGS samples; n = 17,473). We performed TADA in females (total 4885) and males (total 19,160) respectively and identified female genes and male genes. Next, we conducted Gene Ontology set enrichment test and visualized network of enriched pathways using clusterProfiler package (v4.11.1) in R.
Clinical phenotype dataTo investigate sex differences in clinical features, we assessed core symptoms including total symptom severity (summed score of social communication deficits and restricted/repetitive behaviors), social communication deficits, and restricted/repetitive behaviors and cognitive/adaptive function. Higher phenotypic scores of autism core symptoms reflect a clinical outcome with more distinct features of autism, whereas lower phenotypic scores of cognitive/adaptive functions indicate more impaired cognitive/adaptive ability.
Total symptom severityOverall severity of autism-related symptoms was assessed using a variety of instruments. The Autism Diagnostic Observation Schedule-2 (ADOS-2) [38] was administered to children and their siblings, utilizing different modules tailored to each participant’s age and verbal language ability. For comparison across the different modules, the total calibrated severity scores (CSS) were used for analysis. For the Korean autism cohorts, we used the Korean-translated version [39] of ADOS-2 that was approved by the Western Psychological Services. Additionally, caregivers completed the social responsiveness scale (SRS) [40] and the Social Communication Questionnaire (SCQ) [41, 42], which measure the overall severity of autistic symptoms. For the SRS, T-scores were used, while for the SCQ, scores from both the current and lifetime versions were included in the analysis. Parents also completed self-reported questionnaires, including the Autism Quotient [43] and the broad autism phenotype questionnaire (BAPQ) [44]. Across all instruments, higher scores were indicative of greater severity in autism-related symptoms.
Social communication deficitsTo evaluate social communication skills, social affect CSS scores from the ADOS-2 (ADOS SA) were utilized, alongside the autism diagnostic interview, revised (ADIR) [41] social interaction (ADIR A) and communication domains (ADIR B). The ADI-R, consisted of a semi-structured interview with caregivers, was administered by trained professionals who rated each question item. Based on the diagnostic algorithm, four domain scores were aggregated. Considering each participant’s developmental trajectory, the communication domain within the ADI-R was further divided into subdomains, specifically catering to individuals with and without fluent verbal communication (ADIR B verbal and non-verbal). Consistent with the scoring approach of the ADOS-2 CSS, higher scores on the ADI-R indicated greater difficulties in social communication.
Restricted interest and repetitive behaviorRestricted interest and repetitive behaviors (RRB) were measured using both the RRB CSS from the ADOS-2 and the RRB subdomain of the ADI-R (ADIR C).
Cognitive abilityIntelligence was assessed using the Wechsler Preschool and Primary Scale of Intelligence (WPPSI) [45], Wechsler Intelligence Scale for Children (WISC) [46], and Wechsler Adult Intelligence Scale. For individuals with limited verbal language abilities, the Leiter international Performance Scale-Revised (non-verbal IQ) [47] was administered to measure non-verbal IQ.
Adaptive behaviorsThe Vineland adaptive behavior scale (VABS)-II [48] was utilized to assess adaptive functioning in children and their siblings. The VABS-II is a caregiver-report questionnaire covering various domains, with each item inquiring whether the child can perform the specified task. Four domain scores—socialization, communication, daily living, and motor skills—along with a composite score, were calculated and standardized based on age-matched normative control individuals. Lower scores in each of these domains were indicative of developmental delays.
Statistical analysesFor the DNV association tests, we prioritized DNVs with loss-of-function observed over expected upper bound fraction (LOEUF) score [30] for PTV and missense badness, PolyPhen-2, and constraint (MPC) score [49] for MIS. The de novo PTV with LOEUF < 0.37, and MIS with MPC ≥ 2 were used for association test. We estimated the power in DNV association test for de novo PTV and MIS in Korean autism cohort and compared the result with that from a previous report [7] in more than 20,000 samples in European-ancestry autism cohorts. Although we lacked the statistical significance threshold during the DNV analysis for the Korean autism cohort, power estimation revealed that this was likely due to a limited sample size (Additional file 1: Fig. S3; Additional file 3: Table S2). The burden of de novo PTV and MIS was compared between individuals with autism and non-autistic siblings, and between autistic females and males, using a one-sided binomial test. For comparing the percentage of samples with de novo PTVs between individuals with autism ± ID, and siblings, we used the Fisher’s exact test.
For polygenic burden association tests, we compared the PS across groups and sexes, using two-sample t tests. To assess the relative difference of polygenic burden, we compared the OR and P-value from the logistic regression as follows:
Group (cases with certain phenotype severity versus siblings) ~ PS.
We compared clinical phenotypes across groups and sexes, using two-sample t tests and two-way ANOVA tests, followed by Tukey’s test for multiple comparisons. To ensure the sex differences in clinical phenotypes in cases, siblings, and parents, we adjusted P-values from two-sample t tests with the number of clinical features for each domain.
留言 (0)