Identifying the neurodevelopmental and psychiatric signatures of genomic disorders associated with intellectual disability: a machine learning approach

Participants

We defined ND-GCs as conditions associated with increased risk of neurodevelopmental symptoms [22] and caused by a genetic variant which was either pathogenic or likely pathogenic, according to American College of Medical Genetics and Genomics guidance [23]. We aimed to recruit a population of participants with a range of ND-GCs that represented a “snapshot” of presentations to UK Child and Adolescent Mental Health Services, Intellectual Disability, Clinical Genetics or Community Paediatrics clinics.

Families of children with a confirmed ND-GC, aged over 4 years, were recruited through UK Medical Genetics clinics, word of mouth and the charities UNIQUE (https://rarechromo.org) and Max Appeal (https://www.maxappeal.org.uk), as part of ongoing cohort studies at Cardiff University including the ECHO study (https://www.cardiff.ac.uk/cy/centre-neuropsychiatric-genetics-genomics/research/themes/developmental-psychiatry/copy-number-variant-research-group) and the IMAGINE study (https://imagine-id.org) [22, 24]. Detailed information regarding the cohort inclusion criteria is available in the IMAGINE study protocol https://imagine-id.org/healthcare-professionals/study-documents-downloads-page/.

Siblings closest in age to individuals with a ND-GC, who did not have a known ND-GC themselves, were recruited to the study as controls; siblings were not excluded if they had any neurodevelopmental or physical health-related conditions.

In total, 589 individuals (441 individuals with a ND-GC and 148 siblings) were included in the study, from whom data from 493 individuals were included in our machine learning analysis after initial data preparation (Additional file 1: Methods). Participant demographic characteristics are shown in Table 1. Our sample size was the maximum number of participants in our dataset who had all the required variables.

Table 1 Demographic information about the sample of children affected by a ND-GC and sibling controls

Informed, written consent was obtained prior to recruitment from the carers of participants and recruitment was carried out in agreement with protocols approved by relevant NHS and university research ethics committees. Individual ND-GC genotypes were established from medical records and in-house genotyping at the Cardiff University Centre for Neuropsychiatric Genetics and Genomics using microarray analysis. The ND-GCs of participants are shown in Table 2.

Table 2 Counts of the genotypes of all study participantsAssessments

Primary carers of participants completed a battery of assessments to collect comprehensive information on physical and mental health problems through semi-structured interviews with trained research staff and questionnaires. Assessments were carried out between January 2011 and December 2019.

Our goal was to generate a set of discriminating items that could be quickly, easily and conveniently completed by a carer or community clinician either on paper or online, and which could serve as the basis for the development of an instrument screening for the most likely domains in which young people with ND-GCs can experience difficulties. Therefore, measures which involved complex or prolonged assessments, such as IQ or motor co-ordination, or potentially intrusive testing, such as blood tests, although important for a full and in-depth assessment of phenotype in some settings, were not included in the current analysis.

Psychiatric symptoms were measured using the Child and Adolescent Psychiatric Assessment (CAPA, [25]), Strengths and Difficulties Questionnaire (SDQ, [26]) and the Social Communication Questionnaire (SCQ, [27]). The CAPA assesses a broad set of psychopathological domains including ADHD, anxiety disorders, oppositional defiant disorder, obsessive compulsive disorder, psychosis and psychotic experiences, tic disorders, mood disorders, and substance abuse. The SDQ is a dimensional measure of psychopathology that includes measures of hyperactivity, emotional problems, peer problems, and prosocial behaviour. The SCQ measures ASD-associated symptoms and was used as the CAPA and SDQ lack of coverage of ASD symptoms.

Difficulties with coordinated movement are also an important symptom in individuals with ND-GCs [10, 24, 28, 29]; therefore, we assessed motor coordination using the developmental coordination questionnaire (DCDQ, [30]).

Information about physical health problems and development was collected through a detailed questionnaire covering developmental history including pregnancy and birth and health problems in all major organ systems. A full list of all gathered variables is available on the IMAGINE ID study website https://imagine-id.org/wp-content/uploads/2019/04/Online-Data-dictionary-16.04.19-v2.pdf.

Included items were selected to cover a wide set of domains, including neurodevelopmental disorders, psychopathology more broadly, general health and development, motor development, social and communication skills and areas of strength and prosocial skills.

After variable filtering for excessive similar responses and missing data, all but one variable (birth weight in kg) was either binary or ordinal. We therefore did not perform any transformation on our variables.

Statistical analysis and data availability

All statistical analyses were carried out in R version 4.2.1 [31]. An overview of the analysis workflow is presented in Fig. 1. Code used in the project is provided in a GitHub repository: https://github.com/NADonnelly/nd_cnv_ml and fitted models are presented as an interactive Shiny app: https://nadonnelly.shinyapps.io/cnv_ml_app/. Data from the IMAGINE study are available via the IMAGINE ID study website: https://imagine-id.org/healthcare-professionals/datasharing/. Analysis is reported in line with the TRIPOD guidelines, Additional file 1: Table S1 [32]. An early version of this manuscript was deposited as a preprint: https://doi.org/10.1101/2022.12.16.22283581.

Fig. 1figure 1

Flowchart of analysis workflow including variable and participant selection and machine learning model fitting. CV: cross-validation; ML: machine learning; PCA: principal components analysis; PLSDA: partial least squares discriminant analysis

Dimensional structure assessment

We applied principal components analysis (PCA) followed by partial least squares discriminant analysis (PLSDA, where the outcome was ND-GC status) to explore the dimensional structure of our dataset, using the mixOmics package [33]. A cross-validation process was used find the optimal number of components and variables for the PLSDA (Additional file 1: Methods).

Machine learning model fitting

We prepared our data for machine learning (ML) model fitting by splitting participants into a training dataset of 393 (80% of the dataset) and a test set of 100 (20% of the dataset), stratifying by ND-GC status, sex and age (categorised into quintiles). The distribution of demographic characteristics in the test and training sets was reasonably balanced (Additional file 1: Table S2).

Our outcome was binary classification of ND-GC status (with ND-GC vs control), and we evaluated model performance using the area under the receiver operator characteristic curve (AUROC) and Brier Score (mean squared error between predicted probability and true ND-GC status, where controls were scored as 0 and individuals with an ND-GC as 1).

We used penalised logistic (elastic net) regression (using the glmnet package [34]), random forests (using the Ranger package [35]), radial basis function support vector machines (SVMs, using the kernlab package [36]) and single layer artificial neural networks (using the nnet package [37]) to create models capable of capturing linear and nonlinear relationships.

Models were fit using nested cross-validation (CV), with 20 outer folds and 20 inner folds. Outer folds were generated by splitting the data into 5 folds, repeated 4 times. Inner folds were generated from the outer fold analysis set using bootstrapping with replacement.

Within each outer fold missing data were imputed using bagged tree models [38], and the same model was used to impute missing data in the analysis set.

Grid search (30 elements) was used to optimise hyperparameters for ML models across inner folds. Model performance was evaluated by fitting the model with the best performing set of hyperparameters in the inner fold data to the (previously unseen) outer fold assessment dataset. This process was then repeated for all outer folds (Additional file 1: Methods).

As an additional analysis, as our dataset was imbalanced with regard to ND-GC status, we also trained and evaluated machine learning models after either downsampling the number of individuals with ND-GCs to be equal in number of controls; or upsampling control individuals to be equal in number to those with ND-GCs, using random resampling with replacement.

Following nested CV, we selected models with the highest AUROC, and evaluated the importance of all included variables for model prediction using permutation testing [39]. We selected the top 30 variables for all ML models and generated two further variable sets: all variables which were included in the top 30 most important for more than one ML model, and those variables included in the top 30 for at least 3 models, to give a total of 6 sets of variables.

We extracted 30 variables for each model because we wanted to achieve a balance between accurate prediction, including a wide set of variables for exploration of dimensional structure and limiting the number of items to that which could be realistically completed by young people’s carers and/or clinicians as a brief screening tool to be used in a clinical setting.

We repeated our nested CV process using the same ML models using the 6 sets of most-predictive variables, giving a total of 24 combinations of models and predictor variables, selecting the best performing combinations of variables and ML model, based on AUROC.

We evaluated the performance of the final models using the held-out training data. Missing data in the test dataset was imputed using a model fit to the full training dataset, and the ND-GC status of each participant in the test dataset was predicted using the best ML models.

Model performance was evaluated by drawing 2000 bootstrap samples from the test dataset and estimating performance (AUROC and Brier Score) for the bootstrap sample. This produced a distribution of values from which a median value and a 95% confidence interval were calculated.

Model calibration, i.e. the relationship between true and model-predicted probability of ND-GC status, was estimated by binning model predictions by predicted probability of ND-GC status and plotting this against true ND-GC status. Model performance was also estimated for male and female participants separately, and after binning participants by age quintile.

The importance of each variable in the best fitting model was evaluated using a permutation-based approach, as above.

The optimal threshold for converting model predicted probability of ND-GC status into a binary classification was estimated by finding the threshold which maximised the j-index (sensitivity + specificity – 1, [40]).

Exploratory graph analysis

Bootstrap exploratory graph analysis (EGA) was used to investigate the dimensional structure of the best performing variable set. EGA has been shown to be as accurate or more accurate than traditional factor analytic methods such as parallel analysis [41]. Bootstrap EGA estimates and evaluates dimensional structure in a set of variables by first applying a network estimation method (EBICglasso as applied using the qgraph package [42]), followed by a community detection algorithm for weighted networks (Walktrap community detection algorithm [43]). Nonparametric bootstrapping is then used to generate bootstrap samples (n = 9999) from the input dataset, and EGA was applied to each replicate sample to form a sampling distribution from which the median value of each edge across the replicate networks, resulting in a single network. The stability of the network can be assessed by measuring the proportion of bootstrapped networks where a given variable is included in each putative dimension [44], and the number of variables included can be adjusted to improve the stability of dimension representations. We therefore fit an EGA model to a full set of variables, then repeated the analysis with the variables with the most consistent relationship to our dimensions (item stability > 0.75; this left 19 variables), generating a stable and consistent EGA model.

To provide an additional assessment of the fit of the proposed dimensional structure to the data, confirmatory factor analysis was carried out on the typical dimension structure identified by bootstrap EGA, with fit assessed using the comparative fit index (CFI) and root mean square error of approximation (RMSEA).

Finally, we repeated the above model fitting processing using the most important variables in each of the five dimensions identified by EGA.

留言 (0)

沒有登入
gif