A systematic review of the effectiveness of machine learning for predicting psychosocial outcomes in acquired brain injury: Which algorithms are used and why?

Background

The variation in psychosocial outcomes after an acquired brain injury (ABI, an injury to the brain sustained after birth including stroke and traumatic brain injury [TBI]) challenges health and social care services to provide advice and guidance to the person, their family, and for socioeconomic implications. Currently, ‘evidence-based practice’ relies almost exclusively on the results of parametric analyses of group-level central tendency derived from randomized clinical trials, which offers very little guidance for individualized care. The study of clinical prediction rules to accurately predict an individual’s psychosocial outcome at a future time point after ABI would serve timely resource allocation and risk management, as well as being able to adapt interventions for known risk factors to maximize the likelihood of more favourable outcomes.

Machine learning (ML) is an evolving methodology in clinical research, offering a possible solution to limitations with traditional methods of modelling and potentially providing better applicability of research findings to individualized clinical decisions through developing clinical prediction rules. Supervised ML learns from the data how to best predict the outcome in question (Hastie, Tibshirani, & Friedman, 2009; Ch 2). Whilst ML was predominantly employed by data scientists and statisticians, it is becoming an increasingly popular approach for clinicians and clinical researchers to consider its use for tackling the large and complex data sets typical of routine clinical data.

The clinical applications of ML have expanded from medical and genetic research, to psychological research questions. Predicting psychosocial outcomes, such as the likelihood of developing mood disorders or being able to return to work after an ABI, typically have a higher degree of subjectivity than medical outcomes, and the measurement around such variables can include higher proportions of noise (Mascolo, 2016). Despite growing popularity, how well ML performs at predicting such outcomes in ABI is unknown.

To date, there has been no review or guidance for using ML to predict psychosocial outcomes in ABI; however, a previous systematic review has shown superior power for ML methodologies to predict neurosurgical outcomes (Senders et al., 2018). Unfortunately, as no risk of bias (ROB) assessment was completed for the review it greatly limits the applicability of their findings. In recent years, guidance has been developed for prediction research (e.g., Moons et al., 2015; Wolff et al., 2019), allowing thorough evaluation of prediction models. Without such guidance, common data mistakes can lead to biased results. By evaluating psychosocial ABI research, clinicians will benefit from being able to understand the effectiveness of using ML algorithms across ABIs, consider the suitability of ML for data sets commonly available within services, and work towards developing accurate prediction tools to assist clinical decision-making.

Objectives

This systematic review aimed to evaluate research employing ML to develop models for the prediction of psychological, social, and/or functional outcomes after ABI.

In particular, this review set out to answer:

How effective is ML for making psychosocial predictions for people with ABI? Which ML algorithms are most commonly used? What is the rationale for the choice of ML algorithms, as stated by the study authors? Method Protocol and registration

The protocol of this systematic review was written in accordance with PRISMA-P (Moher et al., 2015) and registered on PROSPERO on 15/July/2019, registration number CRD42019140546 [available from: https://www.crd.york.ac.uk/PROSPERO/display_record.php?RecordID=140546]. This review has been written in accordance with PRISMA (Liberati et al., 2009).

Eligibility criteria

Research reports were included with an English language version available in a peer-reviewed journal. All reports up until the search date of 22/July/2019 were initially considered for the review. Due to the large number of eligible studies identified, studies were then limited to those published between 1st January 2016 and 22nd July 2019 to cover articles published after the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidance (Moons et al., 2015).

Participants

Studies included participants with a diagnosis of ABI, such as TBI (mild, moderate, or severe) or stroke. This review included people of any age, gender, or geographical location. Studies which included conditions other than ABI (e.g., other types of physical trauma or neurodegenerative conditions) in the same analysis with people with ABI were excluded.

Exposures and comparators

Studies were included with at least one psychosocial predictor in the final model. Psychosocial was defined as a measure of psychological or behavioural factors (e.g., cognition, mental health, challenging behaviours) or social factors (e.g., participation, accommodation status, employment). Studies were excluded where predictors were all biological (e.g., physical measurements, vital signs, or neuroimaging) or primarily all impairment-based (e.g., Glasgow Coma Scale [GCS], Teasdale & Jennett, 1974). The comparator was the absence of the exposure (predictor) or lower levels of the exposure where measured on a dimensional scale.

Outcomes of interest

Studies predicting a psychosocial outcome were included, with psychosocial defined as above. Studies were excluded where predictors and outcomes were measured at the same time point (e.g., questionnaire items predicting questionnaire outcome). This review excluded outcomes designed specifically for disciplines other than psychology (e.g., speech and language therapy measures, physiotherapy measures), measures which are primarily impairment-based (e.g., GCS) or neurological (e.g., neuroimaging, cerebrospinal fluid).

Study design

Studies were required to be observational designs which reported the development of a supervised ML model. ML was defined as ‘algorithms [which search] through a large space of candidate programs, guided by training experience, to find a program that optimizes the performance metric.’ (Bzdok, Krzywinski, & Altman, 2017 p. 1119). An ML technique is ‘supervised’ if it uses known outcome data as part of model learning. Studies reporting the application of a previously developed model and which did not include model development results were excluded.

Search and study selection

Published literature was reviewed from MEDLINE (PubMed), Web of Science, EMBASE (OVID interface, 1990 onwards), CINAHL, and PsycINFO (EBSCOhost interface, 1990 onwards), up until the date of 22/July/2019. The full search strategy is presented in Appendix S1. The search results were managed in the author’s EndNote library (www.myendnoteweb.com). Duplicates were removed during database extraction, and then, titles were screened to remove papers that were not eligible. This screening process was repeated for abstracts and lastly full texts. A second reviewer independently repeated this process for 50 records at the title/abstract stage, and 10 records at the full text stage to check for consistency, showing 100% concordance.

Data collection process

A data extraction template was developed to extract relevant data from eligible studies combined from the Joanna Briggs Institute critical appraisal checklist for cohort studies (Briggs, 2017), TRIPOD (Moons et al., 2015), and additional items specific to the review questions. A full list of extracted data items is available in Appendix S2. The data extraction template was piloted by the primary author for five studies and then amended with two additional items. The final data extraction template was used by the primary author for all studies, and the second reviewer independently for three studies giving an inter-rater agreement of 93.1% (calculated as the percentage of agreement between raters on items), with discrepancies resolved by discussion.

Risk of bias in individual studies

The Prediction model Risk Of Bias ASsessment Tool (PROBAST, Wolff et al., 2019) was used at study level to evaluate bias for each presented ML model in each article, completed by the first author for all included articles and by the second reviewer independently for 3 records to check for consistency. The PROBAST assesses risk of bias across four areas in prediction studies (participants, predictors, outcomes, and analysis), rated by 20 items for ROB and 3 items for applicability. Examples of PROBAST items include the appropriateness of inclusion and exclusion criteria, or whether overfitting, underfitting, and model optimism have been considered in the performance of the model. Inter-rater agreement was 91.7%, indicating high consistency. Differences in opinion were discussed until consensus was reached.

Summary measures and synthesis of results

A narrative synthesis was performed, presented in text and tables. To address the first review question, performance metrics are reported for both the internal validation models and, if applicable, the external validation model, with the area under the receiver operating characteristic curve (AUC, also known as the c-index) being the primary metric of choice. Alternative metrics are reported for some studies. Performance metrics of models were then evaluated as being reliable or unreliable dependent on the ROB ratings of the models. To address the second review question, the frequency of the algorithms used by researchers is reported. For the third review question, the rationale of the author’s choice of methodology was summarized. The findings of these three questions are then used to provide considerations for designing an ML study for predicting psychosocial outcomes in ABI for future researchers.

Results Study selection

Figure 1 shows the flow diagram of the search procedure and the results.

image

PRISMA flow diagram of the study selection process. Abbreviations: ABI = acquired brain injury; ML = machine learning.

Study characteristics

A total of nine studies were included for the systematic review, with brief abstracts available in Appendix S3. Six were from the United States (Bergeron et al., 2019; Cnossen et al., 2017; Gupta et al., 2017; Hirata, Ovbiagele, Markovic, & Towfighi, 2016; Stromberg et al., 2019; Walker et al., 2018), one from Finland (Huttunen et al., 2016), one from Japan (Nishi et al., 2019), and one from Iran (Shafiei et al., 2017). A brief review of study design and analysis by study is included in Table 1.

Table 1. Characteristics of studies included in systematic review Study ABI population Outcome Sample size Analysis design ML methodology Validation procedures 1. Bergeron et al. (2019) Concussion Time to symptom resolve 1,611 concussive incidents Classification NB, SVM, KNN, DTs (C4.5D and C4.5N), RF (with 100 and 500 trees), ANNs (multilayer perceptron and radial basis function network) 10-fold cross-validation, 1 segment reserved for internal validation 2. Cnossen et al. (2017) Mild TBI GCS 13–15 Post-concussive symptoms (cognitive, somatic and psychological subscales, and severity) 277 Regression RLR (lasso) Bootstrap with 100 samples 3. Gupta et al. (2017) Intracerebral haemorrhage Functional outcome at 3 and 12 months 365 (3 months) 321 (12 months) Classification and regression RF for feature selection and then traditional linear and logistic regression External validation 4. Hirata et al. (2016) Stroke Depression 17,132 Classification RF Within random forest uses ‘out the bag’, an embedded validation procedure, but no cross-validation 5. Huttunen et al. (2016) Aneurysmal subarachnoid haemorrhage Antidepressant use 940 Classification DT None 6. Nishi et al. (2019) Acute stroke from large vessel occlusion who received mechanical thrombectomy Good clinical outcome 387 development, 115 external validation Classification RLR, SVM and RF 10-fold nested cross-validation and external validation 7. Shafiei et al. (2017) Mild TBI GCS 13-15 Psychological symptoms 100 Classification ANN backpropagation algorithm 50/50 train test cross-validation repeated 300 times 8. Stromberg et al (2018) TBI (moderate to severe) Current competitive employment at 1, 2 and 5 years 7,867 (1 year) 6,783 (2 year) 4,927 (5 year) Classification DT 85/15 training test split with no cross-validation 9. Walker et al. (2018) Non-penetrating TBI (moderate to severe) Global outcome at 1, 2 and 5 years 10,125 (1 year) 8,821 (2 year) 6,165 (5 year) Classification DT 85/15 training test split with no cross-validation ABI = acquired brain injury; ANN = artificial neural network; DT = decision tree; GCS = Glasgow Coma Scale; KNN = K-nearest neighbours; ML = machine learning; NB = naïve Bayes; RF = random forest; RLR = regularized logistic regression; SVM = support vector machine; TBI = traumatic brain injury.

One study predicted outcomes after concussive incidents (1611 incidents with multiple concussions per person, Bergeron et al., 2019), and the remaining eight predicted outcomes from 64,325 people with ABI in total, including cerebrovascular accident (Gupta et al., 2017; Hirata et al., 2016; Huttunen et al., 2016; Nishi et al., 2019), mild TBI (Cnossen et al., 2017; Shafiei et al., 2017), and moderate to severe TBI (Stromberg et al., 2019; Walker et al., 2018). Two studies used the same database (Stromberg et al., 2019; Walker et al., 2018), and therefore, the same participants were likely in both studies. Outcomes included post-concussive symptoms (Bergeron et al., 2019; Cnossen et al., 2017), functional outcome (Gupta et al., 2017; Nishi et al., 2019; Walker et al., 2018), indicators of mood and psychological symptoms (Hirata et al., 2016; Huttunen et al., 2016; Shafiei et al., 2017), and employment (Stromberg et al., 2019).

Across the nine studies, there were a total of 11 types of ML: regularized logistic regression (RLR), support vector machine (SVM), decision trees (DT), naïve Bayes (NB), K-nearest neighbours (KNN), random forest (RF), artificial neural networks (ANNs, including multilayer perceptron, backpropagation, and radial basis function network), lasso regularization with linear regression, and random forest used for feature selection with logistic regression. Algorithm descriptions can be found in Table 2. Two studies compared more than one type of ML algorithm (Bergeron et al., 2019; Nishi et al., 2019), and five studies examined more than one time point or outcome (Bergeron et al., 2019; Cnossen et al., 2017; Gupta et al., 2017; Stromberg et al., 2019; Walker et al., 2018), giving a total of 75 ML models analysed.

Table 2. Machine learning algorithm definitions Machine learning algorithms Definition Classification Regularized logistic regression A classification algorithm whereby coefficient weights are learned using an iterative method with adjustments within a linear algorithm before being transformed to predict a binary outcome using the sigmoid or logistic function (Nadkarni, 2016) Support vector machine Most commonly used as a classification algorithm whereby vectors are mapped into a high-dimensional space to construct a linear decision surface (Cortes & Vapnik, 1995), with the goal of separating two decision categories Decision trees Decision trees classify predictors by their values among a series of decision branches, until ending with a fairly homogenous class of the target variable (Rokach & Maimon, 2008) Naïve Bayes A probability model based on Bayesian theory, where features are naïve in the sense that they assume independence from other features in a given class (Rish, 2001) K-nearest neighbours (5NN) Commonly used as a classification algorithm where new values are predicted based on the results of other, similar instances (or neighbours). It is common to take the results of more than one neighbour (k) for class determination (Cunningham & Delany, 2020) Random forest An ensemble algorithm where a large number of decision trees are grown, each with a random split of training data from the original data with replacement, using random feature selection/node splits. After which each tree votes for the most popular class at input urn:x-wiley:17486645:media:jnp12244:jnp12244-math-0001(Breiman, 2001). The goal here is to produce a stronger model than single decision trees alone Artificial neural networks Non-linear classification methods which make no underlying assumptions to limit their fit to the data (Zhang, 2000). A series of interconnected nodes are linked between predictors and output in a similar way as a neural network in the human brain Regression Least absolute shrinkage and selection operator (lasso) regularization with linear regression In the regression equation, lasso sets certain coefficients to 0, with the goal of increasing prediction accuracy whilst maintaining interpretability (Tibshirani, 1996) Random forest feature selection, used with linear regression Features identified by random forest (as described previously) are used to enhance performance of statistical regression algorithms Quality of the evidence

Quality ratings of the 75 models were aggregated by study since each model received the same score within each study (reported in Table 3), with the rationale for ROB scores in Table 4. Across the studies reviewed, each of the 75 ML models scored as being high ROB, with the main source of bias being the analysis. Every study failed to appropriately evaluate the developed models with use of calibration metrics, meaning the model’s performance for individual probabilities is unknown. One study reported no model evaluation statistics for performance, discrimination, or calibration (Huttunen et al., 2016). Other common causes for high ROB were improper handling of missing data, not using appropriate techniques to account for model optimism and overfitting (such as internal nested cross-validation or bootstrapping), and poor reporting for how models performed after post-hoc refinement.

Table 3. Summary of aggregated risk of bias ratings using PROBAST (Wolff et al., ) by study (n = 75 total risk of bias ratings) Study Number of models evaluated with PROBAST Participants Predictors Outcome Analysis ROB conclusion for overall assessment 1.1 1.2 Overall 2.1 2.2 2.3 Overall 3.1 3.2 3.3 3.4 3.5 3.6 Overall 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 Overall 1. Bergeron et al. (2019) N = 60 Y PY Low N NI Y High PN PY N NI PN PY High Y NI NI NI Y N/A N Y N/A High High 2. Cnossen et al. (2017) N = 1 Y Y Low Y Y Y Low Y Y Y Y Y Y Low PY Y N Y Y N/A N Y PY High High 3. Gupta et al. (2017) N = 2 Y Y Low Y Y Y Low Y Y Y Y Y Y Low PY Y N N Y Y N N PY High High 4. Hirata et al. (2016) N = 1 Y PY Low Y NI Y Low Y Y Y Y PY Y Low Y PY Y N Y N/A N N N/A High High 5. Huttunen et al. (2016) N = 1 Y Y Low PY PY Y Low Y Y Y Y Y Y Low Y NI Y PY Y N/A N N PY High High 6. Nishi et al. (2019) N = 3 Y Y Low PY Y Y Low Y Y Y Y PY Y Low PY Y Y N Y N/A N Y NI High High 7. Shafiei et al. (2017) N = 1 Y Y Low PY Y Y Low Y Y Y Y NI Y Unclear PN NI PY PY Y N/A N PN N/A High High 8. Stromberg et al (2018) N = 3 Y Y Low Y NI Y Unclear PY Y Y

留言 (0)

沒有登入
gif