Patterns and distribution of de novo mutations in multiplex Middle Eastern families

Cohort description and QC

A total of 146 families (n = 645 individuals) were enrolled in this study, of which 47 (32%) reported a history of consanguinity (first- or second-degree parental relatedness). Parental ages at conception varied as follows: fathers (median: 34 years old, range: 21–50) and mothers [29, 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44]. Family sizes differed across the cohort (median: 2 offspring, range: 1–10), with more than 70% of families being multi-offspring (Supplementary Fig. 1). All participants underwent WGS to an average depth of 31.6X (Supplementary Fig. 2), with variants aligned and called as described in the Methods. After QC, a total of 353 unique trios could be established from the cohort (one child plus both parents), which were selected for DNM calling and annotation. The trios included 190 males and 163 females, 45.3% of whom had an underlying rare disorder. A summary of these statistics is provided in (Table 1).

A combinatorial approach to calling DNMs and calculating DNM rate

To ensure the comprehensive ascertainment of variants and to improve sensitivity and specificity, we used a combination of three different approaches to identify de novo mutations with a three-step workflow (details in Methods, Supplementary Fig. 3). In total, we identified 24,808 high-quality de novo single-nucleotide variants (SNVs) and 2360 INDELs in 353 trios (Table 1 and Supplementary Fig. 4), with the median genome containing 70 de novo SNVs (average = 70.3) and 6 de novo INDELs (average = 6.7). Taking into consideration the effective genome coverage of around 2.797 billion base pairs (see Methods) and genomic diploidy, we calculated a median DNM rate of 1.25 × 10−8 and 1.07 × 10−9 per base per generation for SNVs and INDELs, respectively. These rates are consistent with previous reports [4, 5, 30, 36].

Effect of parental age on DNM count and differences across families

To determine the parental contribution to DNMs, we sought to determine the parent-of-origin where possible using read-based phasing. Given the 150-bp read length, we were able to phase 13% (range: 4.2–25%) of the de novo variants on average (Supplementary Fig. 5). Among the 3537 variants phased, 2817 were paternal in origin and 720 were maternal. This corresponded to a paternal-to-maternal DNM phasing ratio of ~3.91:1, in line with prior estimates [30].

We then checked for correlations between parental age and DNMs across the cohort (Fig. 1A). Although we observed a significant increase in DNMs by 1.36 per year of paternal age (Pearson correlation; 95% CI: 1.11–1.61, p = 1 × 10−22), we observed a weak correlation with maternal age, with an increase of 0.33 DNMs per year (Pearson correlation; 95% CI: 0.11–0.56, p = 3.8 × 10−3). These results agree with previous findings that show parental age effects on the DNMs found in offspring [5, 6, 9, 37, 38].

Fig. 1figure 1

Parental age effects on DNM counts. A Correlation between parental age at conception and number of phased DNMs normalized to the total number of phased DNMs in each individual, performed across all families. The blue regression line (slope = 1.36, 95% CI = 1.11–1.61) shows paternally phased DNMs, while the red line (slope = 0.33, 95% CI = 0.11–0.56) shows maternally phased DNMs. B Paternal age is plotted against the number of total autosomal DNMs for individuals in large families (number of offspring ≥4, total = 21 families), with each family analyzed separately. Families were plotted in order of ascending correlation for easier visualization. Slopes of the regression lines range from −0.54 (95% CI: − 7.06–5.97) to +7.74 (95% CI: 3.14–12.33). C A Poisson regression for each large family. The plot shows the slope of each regression ± 95% confidence intervals. The vertical line indicates the average paternal age effect for all families in this model

Given that DNMs count can be affected, in addition to paternal age, by other factors such as family membership, number of offspring, and ancestral population, we applied a Poisson regression analysis that integrates these factors into the model. Initially, we built a null model focusing on the paternal age effect on the DNM count but using only families with more than three offspring (n = 21). This showed an estimated paternal age effect of 1.57 DNMs per year (95% CI: 1.29–1.85, p < 2.2e−16) (Supplementary Table 1). Then we fitted a Poisson regression model by adding family membership to the model that incorporated paternal age and DNM count, and found that the paternal age effect significantly varies between families, ranging from −0.39 (95% CI: −1.66–0.87) to 7.8 (95% CI: 4.28–11.40) additional DNMs per year (Fig. 1B, C). This interaction model fits better than the null model and gives an improved regression model (p = 0.002). This model shows an average increase of 2.1 (95% CI: 1–3.2) DNMs per year of paternal age. We also examined two more Poisson models that test the paternal age effect on DNM counts and separately add number of offspring and population, but both factors had no significant effect on the relationship of paternal age and DNM count (Supplementary Table 1).

Effect of additional siblings on DNM detection accuracy

We stratified the cohort based on the number of offspring per family to investigate if additional offspring reduced the DNM counts in each “index” child (Supplementary Fig. 1). We found the average number of DNMs per individual to be lower in larger families compared to smaller families, with a reduction of around 1.15 DNMs per added sibling (Fig. 2), suggesting that sequencing additional family members can significantly improve the ability to discriminate true de novo variants from rare inherited ones.

Fig. 2figure 2

Effect of family size on DNM counts. Number of offspring per family is plotted against DNM count in individuals. The red line represents the regression line (slope = −1.15) with 95% confidence intervals shown in gray

De novo mutation load and consanguinity

Given the high level of consanguinity (~32%) in our cohort, we explored whether there was evidence for a correlation between consanguinity and DNM count (Fig. 3, Supplementary Fig. 6). Rather than relying solely on reported parental consanguinity, we computed each child’s relatedness score using KING (see methods) [31]. Although the number of DNMs was not expected to be affected by consanguinity, the offspring of consanguineous marriages appeared to have fewer DNMs (p value = 0.033) (Fig. 3A). To rule out confounders impacting this correlation, we also examined the relationship between relatedness score and both the father’s age at conception (Fig. 3B) and family size (Fig. 3C). Notably, consanguineous parents in our cohort appeared to have had children at younger ages than non-consanguineous parents, as well as larger family sizes, which, as explained earlier, reduces the DNM count due to sibling sharing. As expected, correcting for these factors (the father’s age in particular) uncoupled the relatedness score from the DNM counts (Fig. 3D, E), providing a rational explanation for why the offspring of consanguineous parents appeared to have fewer DNMs than non-consanguineous trios.

Fig. 3figure 3

De novo mutation load and consanguinity. Parents in each trio in the dataset were categorized into 1st degree cousins (blue), 2nd degree cousins (green), and unrelated (red). Boxplots show the median and interquartile range, and p values are shown above brackets. Plots show the correlation between relatedness scores and (A) DNM count, (B) father’s age at conception, (C) family size, (D) DNM count after correcting for father’s age, and (E) DNM count after correcting for family size

DNM spectra and mutational signature

We next examined the distribution of DNMs in relation to base changes (Supplementary Fig. 7A). Consistent with previous reports [39], we found a nearly 2-fold enrichment in transitions versus transversions, with 35% of DNMs being C > T. To drill deeper into the genomic context of the DNMs, we examined all possible DNA sequence triplets at the DNM sites which, together, make up the mutational signature of de novo mutations (Supplementary Fig. 8). Among the highest proportion of DNMs (i.e., C > T substitutions), CpG sites were found to contribute to a large fraction of the DNM events. The same mutational signature was also discovered previously in three different trio datasets [40]. We then followed the above approach to compare the fractions of phased DNMs from both parents (Supplementary Fig. 7B). We found no statistically significant difference in the DNM spectra by parent-of-origin, likely due to the relatively limited number of phased DNMs per individual.

We further examined the effect of local GC content on mutability. Using a sliding window approach (with windows ranging from 10–1000 bases), we extracted the genomic sequence from around each DNM from the GRCh37 reference genome and calculated the GC content surrounding SNVs or INDELs (Supplementary Fig. 9). We found a higher GC content near SNVs and a lower content near INDELs compared to the average genomic GC content of 41% [41].

We next calculated the mutation rates of both transition and transversion variants with respect to CpG site (Supplementary Table 2). We found that CpG dinucleotides had much higher mutation rates compared to non-CpGs, and the difference was clear.

CpG methylation as a driver for DNM development

The high correlation between DNMs and CpG sites suggested that methylation levels play a role in the genesis of DNMs in parental gametes. To assess this, we compared the mutation rates at CpG sites with respect to the level of methylation (i.e., percentage of reads containing a methyl group) across human tissues. First, when we examined adult spermatogonial stem cells (SSCs) [33], we observed a total of 3,801 variants in our DNM catalog at CpG sites, 475 of which were paternal in origin. Surprisingly, for the paternally phased variants, we found that the highly methylated CpG sites (i.e., >50% of reads methylated) were 2.03 times (binomial p value = 6.62 × 10−11) more likely to have DNMs than the low-methylation sites (Table 2). To improve the specificity of this observation, we performed the same analysis using the methylation profiles of two other human tissues as controls: liver cells and PBMCs [34, 35]. We found much smaller fold-change differences between the methylation levels in terms of mutation rate (binomial p values = 0.03 and 0.004 for liver cells and PBMCs, respectively). These results provide further evidence for the key role of CpG methylation in the development of de novo mutations.

Table 2 Fold difference in the fraction of DNMs based on methylation levelsDNM localization and count in different populations and disease phenotypes

We next examined the genomic localization of the 27,168 DNMs in our cohort. Among these variants, 459 were in coding regions (average per child = 1.3, median = 1). This represents 1.7% of the total number of variants, which is consistent with the proportion of coding bases in the human genome. We also found 43 (0.16 %) loss-of-function de novo variants, of which 13 (30%) were predicted to cause nonsense mediated decay.

We stratified our cohort based on ethnicity and disease phenotype to assess if there were differences in the DNM counts in these categories. African and South-Asian populations seemed to have a significantly higher number of DNMs compared to Middle Eastern and Caucasian populations, as shown in Fig. 4A. However, when we accounted for differences in paternal age between these populations (Fig. 4B), we found that the fathers’ ages at conception could be the factor driving the differences in DNM counts across the populations. After correcting for the father’s age, none of the populations remained significantly different from the others, although the African population showed a higher trend across populations.

Fig. 4figure 4

DNM counts by sub-population and disease phenotypes. Boxplots show the median and interquartile range, and p values (Bonferroni corrected in A and B) are shown above brackets. Plots show the (A) DNM counts in different populations, (B) DNM counts in the populations normalized to the father’s age, and (C) DNM counts with regard to disease phenotypes

After looking into the differences among phenotypes in terms of DNM counts, we found that probands with neurogenetic disorders had, on average, more DNMs compared to healthy subjects, although that difference did not achieve statistical significance. We also found no significant differences between any of the other phenotype groups and healthy subjects (Fig. 4C). To delve deeper into the variants in our dataset, we used the annotated files to compare several metrics between the different phenotypes. We compared the percentage of variants with certain thresholds of combined annotation-dependent depletion (CADD) scores (Supplementary Fig. 10A), genomic evolutionary rate profiling (GERP) scores (Supplementary Fig. 10B), combined CADD and GERP scores (Supplementary Fig. 10C), and predicted loss-of-function intolerance (pLI) scores (Supplementary Fig. 10D), with no significant differences between the phenotypes.

Finally, we tested if the variants with different functional effects were enriched in certain phenotypes (Supplementary Fig. 11). To do this, we first compared the phenotypes using the normal functional impact annotation of “LOW”, “MODERATE”, and “HIGH” (Supplementary Fig. 11A). We then sub-categorized the variants, based on their functional impact, into protein non-disrupting (“MODIFIER” + “LOW”) and protein-disrupting (“MODERATE” + “HIGH”) variants (Supplementary Fig. 11B). We also failed to find any significant differences between the DNM distributions across the different phenotype groups in this regard.

留言 (0)

沒有登入
gif