Craving for a Robust Methodology: A Systematic Review of Machine Learning Algorithms on Substance-Use Disorders Treatment Outcomes

This systematic review synthesized the findings of 28 publications that explored the application of ML algorithms to the prediction and analysis of treatment outcomes in SUDs. The increasing number of studies published in this area underscores a growing recognition of the potential of ML models within the addiction research community. This emerging movement may offer novel insights that can potentially reshape our understanding of effective SUD interventions. However, several major problems were found, such as inconsistencies in reporting, particularly regarding the data analyzed, in model performance, in data leakage prevention (or lack thereof), in code transparency, and in dataset characteristics. These gaps highlight the need for standardization and rigor in ML applications within SUD research to ensure reproducibility and clinical relevance.

Our findings are in accordance with the broader evidence identified in recent literature, which delineates the emerging integration of ML within various health domains (Andaur Navarro et al., 2023; Andaur Navarro et al., 2023, 2022; Dhiman et al., 2022, 2023; Collins et al., 2014; Farimani et al., 2024). This collective body of work presents a confluence of promising results as well as shortfalls in methodological rigor. Notably, in the addiction research field, similar explorations (Chhetri et al., 2023; Mak et al., 2019) parallel our observations, albeit without a specific focus on methodological critique and clinical outcome orientation that our review prioritizes. To present our findings and guide the readers to points of interest, we break down the revised studies according to the analyzed clinical outcome. For studies adhering to best practices in ML and medical research, we highlight their significant contributions and insights. Conversely, for studies demonstrating methodological biases and oversights, we scrutinize their implications on the findings and propose viable methodological alternatives.

Outcomes

ML was used to predict several SUD-related clinical outcomes, including identifying SUD, assessing treatment adherence, evaluating severity, identifying disease subtypes or trajectories, and predicting relapse, cessation, and readmission. The distribution of these outcomes across the reviewed studies is synthesized in Fig. 3.

Identifying SUD

An SUD diagnosis relies on the presence of specific symptoms indicative of substance misuse,  American Psychiatric Association (2022) (ICD) World Health Organization (2024). The number of symptoms determines the severity (mild, moderate, or severe). The diversity in SUD symptomatology, influenced by the broad spectrum of diagnostic criteria, is a challenge reflected in predictive modeling. Accurate representation of this clinical variability in the data used to train ML models is critical for their effectiveness. Four studies in this review shed light on ML’s ability to distinguish individuals at high risk from those unlikely to develop such disorders. Annis et al. (2022) and Kang et al. (2022) sought to differentiate healthy controls from individuals with SUD, whereas Jing et al. (2020) aimed to forecast SUD risk in offspring of SUD-affected adults. Additionally, Houghton et al. (2023) sought to classify SUD individuals as having significant comorbid PTSD symptoms based on symptoms of anxiety and depression, as well as perseverative/intrusive thought patterns, low tolerance of aversive body sensations, effectively differentiating PTSD-comorbid patients from those with only SUD.

Annis et al. (2022) used a distinct definition of opioid use disorder (OUD) through multiple ICD codes, which includes other conditions, such as specific adverse effects of opioids and opioid poisoning. While this assumption is clinically acceptable for ML model training, some limitations should be noted. The reliance on treatment records is susceptible to data leakage through repeated patient admissions. Furthermore, the dataset’s imbalance, with “no OUD” cases vastly outnumbering “OUD” cases, suggests that model accuracy may inadvertently reflect dataset skew rather than genuine predictive capability. The model’s performance corroborates this: while accuracy is close to 99% in all experiments, the area under the curve (AUC) never exceeds 0.5. Another empirical concern, given the chronic nature of SUD diagnosis (American Psychiatric Association, 2022), is the inclusion of “previous diagnosis of OUD” as a predictive feature, as the model mainly relied on the previous diagnosis instead of making its own prediction.

Kang et al. (2022)’s model demonstrates adeptness in distinguishing between healthy controls (HC) and individuals with OUD. However, the contexts for the OUD and HC groups should be highlighted. OUD individuals were recruited in a compulsory psychiatric facility, which would filter for more impulsive individuals, while HC were recruited from the local community. It is possible that the model learned to differentiate these two contexts instead of identifying the presence of SUD. This is further corroborated by the model’s reliance on impulsivity-related features as predictors, which are absent from SUD diagnostic criteria, calling into question the model’s clinical relevance. This would also explain how they reached an abnormal AUC of 0.99 and 96% accuracy.

Jing et al. (2020) stand out as one of the few included longitudinal studies, monitoring a cohort of children from families affected by SUDs from infancy to adulthood, with the objective of forecasting the development of SUD. The model’s AUC ranges from 0.7 to 0.8, increasing with age, which indicates a reasonable accuracy in predicting SUD onset and could be used to identify high-risk individuals for prior intervention. Despite their disconnection from traditional SUD risk, their findings suggest that early childhood behaviors (such as swearing) might provide a deeper understanding of developmental psychopathology and may offer early signs of developmental trajectories that predispose to SUD.

Treatment Adherence and Dropout

The chronic nature of many disorders requires prolonged engagement with therapeutic interventions to ensure effective treatment and care continuity. Consequently, treatment adherence/completion is crucial for effective treatment of SUDs. Due to their design, treatment modalities, and limitations, the adherence criteria vary throughout SUD research, including program completion, dropout rates, and appointment attendance. Nine studies (Acion et al., 2017; Bailey & DeFulio, 2022; Burgess-Hull et al., 2023; Eddie et al., 2024; Gottlieb et al., 2022; Steele et al., 2018, 2014; Symons et al., 2020; Nasir et al., 2021) focused exclusively on adherence metrics, while two others also included secondary analyses: relapse (Cavicchioli et al., 2021) and treatment prescription (Baucum et al., 2023). Table 2 sheds light on the input features and models employed, which range from traditional logistic regression (LR) to more advanced algorithms like XGBoost and convolutional neural networks (CNN).

Table 2 Characteristics of the studies focused on predicting treatment adherence

Four studies utilized the TEDS-D dataset to train and evaluate models to predict treatment completion. They used similar input features, and three of them reached \(>0.8\) AUC. Yet, methodological oversights should be considered.

The study with a lower AUC (Bailey & DeFulio, 2022) investigated the use of transfer learning. They trained a deep learning model on an OUD dataset and then finetuned (continued training) it on another OUD dataset. This is a promising approach that, if successful, could enhance SUD-related predictions. However, the execution faced critical limitations, particularly in the premature cessation of the model’s training phase, with no proper justification, and its evaluation. While the accuracy is 76%, this derives from the dataset imbalance. Specifically, the model exhibited a pronounced bias toward predicting treatment completion, correctly identifying only 37 instances of Treatment Failure against a ground truth of 310 out of 1, 161 patients.

The TEDS-D dataset variable “length of stay” (LOS) indicates the number of days an individual stayed in treatment. LOS is directly related to the treatment outcome, as low values indicate dropout or other reasons for treatment termination. However, it is unavailable for individuals entering treatment. Accordingly, LOS should not be used to predict treatment adherence. Still, two studies (Acion et al., 2017; Nasir et al., 2021) used it as an input feature to the ML model in order to predict treatment completion. Unsurprisingly, LOS was the most important variable in both study models, which unfortunately skews their whole analysis. Their model high AUC is problematic when considered representative of a clinical setting. Baucum et al. (2023) aim to predict SUD treatment completion and assign treatment modality using both ML and reinforcement learning. They reached an AUC of 0.80 for predicting treatment completion and 0.87 for predicting the optimal treatment plan as determined by each individual’s LOS. Showing no primary limitations, it is evidence that ML might be used to identify individuals with a high risk of not completing treatment.

An additional six studies used other datasets to predict adherence-related outcomes: Symons et al. (2020) juxtaposed the predictive accuracies between ML models and seasoned clinical psychologists on successful alcohol use disorder (AUD) treatment estimation. They used several models for this comparison and adequately only provided information available at treatment initiation to both psychologists and ML models. ML models, especially LR and radial basis function networks, provided more accurate predictions overall. Burgess-Hull et al. (2023)’s approach integrates longitudinal urinalysis and clinical data to predict the next appointment attendance of OUD patients. Their strong AUC of 0.87 emphasizes real-time data integration up to the last appointment and showcases the methodological diversity in harnessing ML for insights into SUD applications. Cavicchioli et al. (2021)’s study aims to predict dropout or relapse in individuals with a primary diagnosis of AUD. They used a regularized LR model, achieving an AUC of 0.76 on predicting drop-out. Higher ASI alcohol scores were the most important feature for prediction.

Steele et al. (2014)’s study uses EEG features from a small cohort of 144 incarcerated individuals to identify patients who discontinued their SUD treatment. EEG features were collected during a Go/NoGo task — specifically stimulus-locked, response-locked, and Event-Related Potentials. They achieved high performance with P2, ERN/NE, and Pe amplitude (brain processes information, responds to errors, and attention and cognitive control) as the strongest predictors for treatment discontinuation, as hypothesized by the authors. Their more recent study (Steele et al., 2018) used several clinical features, including fMRI ones, in another Go/NoGo task. The high performance of both study’s models showcases the relevance of EEG and fMRI features in the context of adherence. Gottlieb et al. (2022) aimed to identify individuals with OUD at higher risk of dropout after 90 and 120 days. Despite imputating features with excessive missing data (50%) and the lack of reports of AUC values, the results were promising, reaching a high sensitivity and specificity. The most important feature of their model prediction was the patients’ quality of life. However, the lack of detailed reporting on SUD diagnoses in these three studies (Steele et al., 2014, 2018; Gottlieb et al., 2022) is a significant methodological shortfall that should be considered in future research, as it impacts the interpretability and generalizability of their findings.

Despite the identification of significant methodological shortcomings in certain studies (Nasir et al., 2021; Acion et al., 2017; Bailey & DeFulio, 2022), which ultimately undermines the reliability of their findings, it is important to recognize that a body of work exists wherein only minor discrepancies or none at all are observed. These studies offer substantial evidence to support the efficacy of predictive modeling in the context of SUD treatment adherence. Accurately identifying patients at heightened risk of dropout provides a valuable foundation for developing targeted interventions and support mechanisms aimed at bolstering adherence rates among SUD patients.

Relapse, Cessation, and Readmission

In the realm of SUDs, terms such as relapse, cessation, and readmission present complex and important phenomena of recovery and treatment. Relapse, clinically defined as a return to substance use after a period of abstinence, underscores the chronic nature of SUD (National Institute on Drug Abuse (NIDA), 2023). It highlights the challenges in maintaining long-term recovery in the face of potential stressors. Cessation, or the successful discontinuation of substance use, represents a critical goal of SUD treatment. Readmission reflects recurrent hospitalizations or treatment enrollments, often referred to as the “revolving door phenomenon.” It points toward the need for more effective or sustained intervention strategies. Predicting these outcomes has significant clinical implications. It enables tailoring treatment plans to individual patient needs, potentially improving care continuity and outcome efficacy. Such predictive efforts can inform the allocation of healthcare resources, guiding timely and appropriate interventions for patients at heightened risk. From the eight papers that fall under this category, five predicted relapse  (Cavicchioli et al., 2021; Davis et al., 2021, 2022; Roberts et al., 2022; Costello et al., 2021) while the others predicted treatment readmission (Morel et al., 2020), abstinence (Yip et al., 2019), and cessation of use (Cox et al., 2020).

Aside from dropout, Cavicchioli et al. (2021) also used regularized LR to predict relapse, but their AUC of 0.51 shows their model was unreliable. Davis et al. (2021) aim to identify relevant features for predicting relapse independently for women and men. While the authors extensively discuss their findings according to the features with the highest impact on the model outcome, the model’s performance is alarmingly poor. The accuracy reported is 56.8% for women and 62.3% for men, which is only marginally better than 50% accuracy of random choice. This causes any subsequent analyses of the most important features of the model decision meaningless, given that the model is probably relying on random features unrelated to the outcome. Their next study (Davis et al., 2022) explores the prediction of post-treatment opioid and/or psychostimulant use. The choice of hazard ratio and concordance index metrics for evaluation is unusual and limits the interpretability and applicability of their model. Their focus lingers on analyzing the most important features while giving little importance to the model performance.

 Roberts et al. (2022) developed models using routinely collected clinical data to predict heavy drinking in patients completing a structured outpatient treatment program — the COMBINE trial (Anton et al., 2006). This study stands out for its innovative application of the leave-site-out (LSO) validation method alongside traditional 10-fold cross-validation. Employing LSO effectively mimicked a real-world scenario in which predictive models should perform well not only in institutions that provide the data used to train them but also on data from other sources. The AUC of 0.71, 0.70, and 0.84 to predict heavy drinking during the first month, final month, and between sessions, respectively, showcases the generalizability and applicability of their model. Costello et al. (2021) use conventional and ML propensity score-based methods to examine the effectiveness of 12-step group involvement in reducing relapse following SUD treatment. They demonstrated that high involvement in the group indeed reduces the likelihood of relapse.

Morel et al. (2020) analyzed readmission within 30 days as a key outcome for both mental and substance use disorders. The training and testing datasets were appropriately divided and described. They reached an AUC of 0.74 with XGBoost, showcasing the potential to identify patients in need of extra care for effective treatment. The most important predictor was length of stay, but several other features also significantly impacted the model prediction. Yip et al. (2019) aim to identify a brain-based predictor of cocaine abstinence, using fMRI and clinical information. They used connectome-based predictive modeling and were able to achieve an accuracy of 0.71. However, the dataset comprises only 53 incarcerated individuals and cannot be considered representative of a real-world scenario. Moreover, the study does not state the diagnosis criteria used to select the participants.

Cox et al. (2020) study aimed to identify the different factors associated with drug use cessation in African-(AA) and European-American (EA) samples. They defined cessation as having last used opioids \(>12\) months before the interview and non-cessation as \(<6\) months before the interview, with intermediate data excluded from the analysis. This practice is an oversight that leads to artificially-improved performance. A more reliable approach would be to use an intermediate cut-off point for cessation. Still, this cut-off dropped only a few instances, and their SVM model reached an accuracy of 0.75 and 0.79 on AA and EA, respectively. They found that for AA, drug-related predictors were mostly cocaine-related, as in EA, stimulants were also relevant.

Severity

The assessment of SUD severity is substantial for effective treatment approaches to individual patients’ needs. Clinically, the severity of SUD is quantified based on diagnostic criteria outlined in the DSM (American Psychiatric Association, 2022) and the ICD (World Health Organization, 2024). Similarly, craving severity is inherently tied to the individual’s difficulty in managing or reducing substance use despite a desire to do so. The heterogeneity in conceptualizing and measuring SUD severity presents challenges for both clinical practice and applying ML in SUD treatment research. Across the reviewed studies, varying approaches to quantifying severity are evident, with three studies concentrating on the severity of cravings  (Heberle et al., 2024; Shrestha et al., 2023; Koban et al., 2023) and one on SUD severity (Suchting et al., 2019). This diversity underscores the complexity of SUD as a clinical condition and highlights the need for precise and adaptable metrics in ML applications.

Shrestha et al. (2023) employ a wrist-worn sensor in a relatively small cohort of 60 individuals who own smartphones, analyzing craving and stress. Despite the innovative approach and the several performed analyses, data from the same participants were included across different cross-validation folds, indicating data leakage. This artificially improved performance and the absence of the ML model description raise doubts regarding the study’s applicabi

留言 (0)

沒有登入
gif