Profiles of autism characteristics in thirteen genetic syndromes: a machine learning approach

Participants

This study used retrospective baseline data from one of the largest cross-syndrome databases in the UK (held at a UK-based university). The total sample included 1702 individuals with genetic syndromes associated with ID and 264 autistic individuals with varying levels of adaptive skills. The database was first set up in 2003, and the last follow-up was completed in 2018. The first wave of data collection included eight behavioural and health measures as well as diagnostic information (i.e., presence/absence of a genetic syndrome). As part of the follow-ups, more measures and groups (including austistic individuals without a genetic syndrome) were added to the database. Currently, this database represents the largest longitudinal data on individuals with genetic syndromes associated with ID in the UK.

Each of the thirteen genetic syndromes was included in this paper due to their reported increased likelihood of autism compared to the general population [16]. We also used opportunity sampling based on the data available at the time of analysis. In total, 1582 individuals with genetic syndromes and 258 autistic individuals who did not have a known genetic syndrome, all over four years of age, were included in the analysis. The genetic syndrome groups had varying sample sizes and included: Angelman (AS, n = 154), Cri du Chat (CdCS, n = 75), Cornelia de Lange (CdLS, n = 199), fragile X (FXS, n = 297), Prader–Willi (PWS, n = 278), Lowe (LS, n = 89), Smith–Magenis (SMS, n = 54), Down (DS, n = 135), Sotos (SS, n = 40), Rubinstein–Taybi (RTS, n = 102), 1p36 deletion (n = 41), Phelan-McDermid (PMS, n = 35) syndromes and tuberous sclerosis complex (TSC, n = 83). Individuals in these groups were included in the analysis irrespective of the presence or absence of an autism diagnosis.

Due to missing data (> 30%) or unsuitable age for assessment of social and communication skills with the SCQ (3 years or younger), 120 individuals with genetic syndromes (AS n = 3; FXS n = 21; PWS n = 25; RTS n = 3; CdLS n = 25; DS n = 9; CdCS n = 6; 1p36 n = 6; LS n = 7; SMS n = 6; TSC n = 4; SS n = 10; PMS n = 1) and 6 autistic individuals without a genetic syndrome were excluded from the study. For demographic characteristics of the entire sample, refer to Table 1.

Table 1 Sample description of the thirteen genetic syndromes and the autistic groupRecruitment

Potential participants and their parents/carers were invited to take part in a questionnaire survey evaluating the behavioural characteristics associated with a range of genetic syndromes. Questionnaire responses were collected from 2003 to 2018. Participants were recruited via syndrome support groups/associations (e.g., Fragile X Society UK, CdLS Foundation UK and Ireland and the National Autistic Society). The recruitment strategy was agreed between the former research centre and the relevant associations/charities to maximise recruitment success and minimise potential burden on the participants. Favourable ethical approval was granted by the Coventry Research Ethics Committee (REC, 10/H1210/1), and the current study underwent institutional governance review. Individuals with genetic syndromes were included in the study if they reported receiving a diagnosis of the genetic syndrome from an appropriate professional (i.e., a paediatrician, clinical geneticist, general practitioner, psychiatrist, or neurologist). Parents and caregivers were also invited to share genetic confirmation letters (where such a record of genetic information was available, and families consented to genetic confirmation sharing). Autistic individuals without a genetic syndrome were included in the analysis if they reached the suggested threshold for autism or autism spectrum disorder (ASD) on the SCQ, indicated that an autism diagnosis had been made by an appropriate professional (i.e., these participants had received a diagnosis of autism from a clinical psychologist, psychiatrist, educational psychologist, speech and language therapist, paediatrician, general practitioner) and confirmed the absence of a genetic syndrome diagnosis.

Inclusion criteria for the current study included: (1) presence of a rare genetic syndrome or autism, or both, (2) age 4 years or older, (3) an ability of the caregiver/child/adult to provide informed consent or assent to participate in the study as appropriate to their capacity to consent/assent, (4) the informant/participant should be fluent in English. All participants who met the inclusion criteria outlined above were included in the analysis.

MeasuresSocial Communication Questionnaire (SCQ) [30]

The SCQ is a widely used screening tool, which focuses on autistic characteristics [30]. This parent/caregiver-report questionnaire is based on the Autism Diagnostic Interview- Revised (ADI-R), which is a well-established diagnostic interview [31]. The SCQ has also been used for understanding autism-related behavioural phenotypes in populations with genetic syndromes and genetic population studies of autism [15].

The SCQ consists of 40 items with a binary response (yes/no). The measure is suitable for individuals who are 4 years or older. There are two versions of the SCQ, lifetime and current. The lifetime version assesses the entire developmental history of the participant, which is used to support diagnostis or to indicate that a diagnosis should be considered. The current version focuses on the participant’s behaviour in the past 3 months, which is suitable for assessing current autistic traits for support and educational plans. SCQ items are scored as 0 and 1; 0 reflects an absence of the relevant behaviour, and 1 reflects the presence of the relevant behaviour. A cut-off score of 15 or greater is suggested by the authors of the measure to indicate the presence of autism spectrum disorder. In the current study, the lifetime version of the SCQ was used.

Wessex questionnaire [32]

This scale quantifies self-help skills for children and adults with intellectual disability, which resulted in its common use in individuals with genetic syndromes [15, 33,34,35]. The items enquire about a variety of different adaptive skills, forming five subscales: self-help skills, speech, vision, hearing and mobility. For the current study, the self-help total score, with a maximum of 9, was used. Self-help scores of 6 and over are classified as a moderate level of skill. The term ‘typical’ has been used to replace the original questionnaire term ‘normal’, in keeping with current terminology.

Statistical analysis

Standard principal component analysis (PCA) based on the SCQ items was used to investigate the extent of overlap between the autism profiles across the thirteen genetic groups (Additional file 1: Fig. S1). We conducted PCA to confirm previous findings that indicate PCA, as an unsupervised analysis, is not the right type of analysis to generate autism-related profiles for genetic syndromes [1]. For Additional file 1: Fig. S1, the first two components (PC1 and PC2), which explained the largest amount of the variance, were selected. The PCA revealed two main clusters, separating mostly individuals who use few or more words from individuals who use no words (A. Additional file 1: Fig. S1). Following this, language items were excluded from the analysis resulting in a single cluster (B. Additional file 1: Fig. S1).

The lifetime version of the SCQ was used, and all 40 items were included in the analysis. However, the traditional binary scoring (1 = Yes, 0 = No) was not reflective of the heterogeneity of language ability across the sample (e.g., language delay vs no language use across the lifespan). For this reason, an additional score of 2 was introduced for the six language-related items for all participants to indicate the absence of language use rather than the absence of autistic-related characteristics for these items across the lifespan. This new score allowed us to include all items for all participants in the classification analysis and capture language heterogeneity at the same time. The coding of all items including the language items is processed as categorical (rather than ordinal) by the model. As a result, the model generates a pattern of responses for each syndrome group. A score of 2 is therefore not weighted as more important/influential than a score of 1 or 0 by the model. However, if a group tends to score 2,1 or 0 on most language items, this type of scoring pattern can help create a unique profile for this group.

Similarly to previous studies [29], the SVM approach was adopted to provide better predictive accuracy of genetic groups of the thirteen genetic syndromes based on their SCQ scores. In essence, an SVM training algorithm uses training exemplars (in this case, with each exemplar being the list of item-level SCQ responses for a given individual) with their category classes (in this case syndrome group membership) to build a model which can then be used to classify novel exemplars (cases). To build the model, the SVM maps training exemplars to high-dimensional space in a manner which maximises space between exemplars of different categories, and a hyperplane/set of hyperplanes are constructed to separate the categories.

Technical specifications were also consistent with previously identified optimal parameters (e.g., the use of radial kernel, the choice of cross-validation method and the approach to generating gamma and cost parameters) [29]. The n-fold validation uses n-1 observations or leave one observation out of the whole sample and builds the SVM classifier on the remaining observations. Previously, this method has allowed for an independent estimate of the accuracy of the entire SVM model on the entire sample [29]. Building the model requires determining the optimal values of the gamma and cost parameters. Using random search of gamma and cost parameter values (up to 100 combinations), the performance of the model was further optimised. Based on the random search, accuracy of the SVM classifier for each combination of gamma and cost parameters was evaluated and the combination of values giving the highest accuracy was chosen. To identify the best parameters, the entire dataset was split in two halves. One half served as the training set, and the other half served as the test set. To deal with unequal sample sizes across syndrome groups, at least fifteen participants from each syndrome group featured in the training set at the stage of identifying the parameters with the highest accuracy. Once the best combination of gama and cost parameters had been identified, the model was trained and tested on the entire sample size, which further helped to reduce over or underestimation of classification accuracy for certain groups.

The final SVM model adopted multiclass classification, reflecting training of multiple binary classifiers, or mapping data points to dimensional space to gain mutual linear separation between every two classes. The multiclass classification uses a ‘one-to-one’, or ‘one-against-one’ approach where k(k-1)/2 (k is the number of the classes) [29]. The final output of the SVM model assigns each of the data points into a ‘predicted’ class, which is the most frequently chosen class by the binary classifiers. Apart from classification accuracy, the decision values of the binary classifiers can also generate predicted probability for each class, which can be an alternative way to assess the confidence of the SVM predictions.

We also carried out an item-level analysis, in which we evaluated each of the SCQ items within each group in order of their importance to the classification results on a scale from 0 to 100, indicating lowest to highest importance, respectively. Although the output of this analysis provides data on all items, we reported the five most and the five least important items for each syndrome group for interpretation purposes and consistent with previous research [29]. For the selection of the items, the following criteria were used: a score of 20 or lower for the five least important items and a score of 50 or higher for the most important items. However, it is important to note that the variability across groups was substantial and some groups evidenced scores as high as 100 for some items (e.g., RTS group), while other groups evidenced scores of 50 for most items (e.g., FXS group). This indicates that for some groups, most SCQ items might have been equally important for their classification results, while for others only certain items might have been particularly important. The selection of the cut-off criteria was data-driven and exploratory. In line with previous work [29], this study adopted libSVM, via the SVM function in the e1071 library in R [36, 37].

留言 (0)

沒有登入
gif