Machine learning and deep learning for classifying the justification of brain CT referrals

Justification is the cornerstone of radiation protection and is legally mandated in many countries [4], but is likewise essential for the appropriate use of valuable healthcare resources while optimising patient diagnostic and treatment pathways. Current evidence, including the outcomes of our justification audit, points to substantial rates of inappropriate imaging, which require further efforts, especially in CT due to its large contribution to population dose. Sites A and B, the two private institutions included in our justification analysis, had lower rates of justified scans compared to the public site C. This raises questions about the vetting processes, particularly in site A, and supports claims that private facilities may prioritise a higher volume of outpatient scans for financial reasons, resulting in more unjustified imaging [23]. In general, the results of the study show that our methods based on ML and DL can automate the iGuide justification analysis of CT referrals and generalise on multi-institutional data. The prediction models used the unstructured clinical indications as the input with justification outcomes serving as target features for multi-class classification. More efficient approaches, involving ML and DL-based NLP, can help streamline the current referral vetting process, which is manual, highly unstructured, costly, and most importantly, inefficient. In addition, since prediction models can analyse thousands of referrals within seconds, large-scale retrospective auditing would become feasible for the majority of stakeholders. It is worth noting that the referrals included in our study pertain to CT scans that have already been performed and were vetted by referring clinicians and radiographers. The fact that ML and DL outperformed both the clinical staff and human experts in vetting referrals suggests that a “second pair of eyes” could prove beneficial when manually interpreting unstructured patient presentations. Leveraging NLP and prediction models could, therefore, assist in justifying medical exposures, as well as interpreting unstructured clinical indications. As a result, the implementation of retrospective and potentially prospective CDS would enable better integration of referral guidelines (e.g., iGuide) into clinical practices, ensuring consistency across various clinical sites.

The structure, style, and quality of radiology referrals vary significantly amongst referrers and institutions [16, 24]. They may contain slang words, uncommon abbreviations, misspellings, and either very brief or overly detailed clinical history. Expressions that convey doubt or uncertainty, and query multiple pathologies are also present, making them ambiguous and complex to interpret. This variability was demonstrated during the retrospective justification audit where both radiographers and consultants disagreed on a statistically significant portion of referrals analysed. The ambiguous referrals in Table 2 illustrate the complexity and possible interpretations. A very first consideration would be determining whether “collapse” corresponds to syncope/fainting with or without a transient or total loss of consciousness. Furthermore, a combination of headaches with collapse makes the interpretation more challenging. iGuide contains three structured indications that are consistent with the patient presentation:

1.

A headache due to collapse and subsequent head trauma—justified.

2.

Migraine (basilar type)—no recommendations, potentially justified.

3.

Altered level of consciousness without known cause—justified.

Likewise, in the second referral, it is unclear whether the CT scan was requested due to vertigo, presyncopal episodes, or a subjective altered facial sensation. The clinical history indicates that the patient had hypertension and hyperlipidaemia, suggesting a potential underlying stroke. Interestingly, in both cases, the radiographers disagreed, and so did the consultants.

The problem of imbalanced datasets is noticeable in classification tasks, as it degrades classification performance [25]. There are considerations with regard to the BoW- and TF-IDF-based models outperforming the Word2Vec and DL approaches. Given that our dataset was limited in size due to undersampling, our Word2Vec and DL models were more prone to overfitting and yet achieved > 90% accuracy on the test set. MLPs with more than three hidden layers were significantly overfitting on the training data. A similar observation was noted with the Bi-LSTM classifier when building a deep architecture, indicating a natural limitation in our dataset. Initially, when developing the model on the imbalanced dataset consisting of 2958 referrals, the bias towards the majority class needed to be addressed. The initial undersampling to the second largest class failed to yield significantly better results. Further undersampling to the minority class helped overcome the bias. Although one reasonable approach to addressing the class imbalance would be synthetic oversampling, there is no unanimity on the best, most efficient, and representative technique in clinical text classification. For example, the word-swapping technique was used to produce synthetic radiology referrals [17], which lack context and are not sufficiently realistic. Authors in this study report nearly 100% training and test accuracy when classifying MRI lumbar spine referrals as justified or unjustified, which is difficult to achieve on real data. Similarly, synthetic minority oversampling (SMOTE) can be used to generate synthetic examples via interpolation between text features of the original example and its random k-nearest neighbours [26]. SMOTE tends to be less efficient when the feature distributions of the classes overlap, which is the case in unstructured referring practices as discussed below. There are several other synthetic oversampling techniques demonstrating promising results, including generative adversarial networks [27]. However, their usability in the referring context and output similarity with the real data needs to be investigated.

The BoW-based gradient-boosting classifier classified three justified referrals as unjustified. Only two referrals, which were deemed justified, contained the word “diplopia” in our undersampled dataset. These two referrals were included in the test set. Additionally, only six referrals contained the word “seizures”, which is associated with all three classes making it less discriminative. Four of these were included in the training set. This suggests that the model randomly assigned a label to the first referral. The same pattern was observed for the remaining two referrals, which had a combination of ambiguous features, highlighting the need for contextual predictions. For instance, a bruise from a fall typically does not require imaging. However, if the patient is on anticoagulation therapy, it is reasonable to consider the possibility of an intracranial bleed.

The prediction model classified one unjustified referral as potentially justified. In our dataset, the term “headaches” is often surrounded by terms suggesting long-term, migraine-like features, which potentially justify CT imaging. In this particular case, there was no additional information provided, leading the human raters to assume that the patient was experiencing chronic headaches without new features, which does not warrant imaging.

There was one unjustified referral classified as justified. The word ‘haematoma’ is strongly associated with “cerebral” and “subdural” in our dataset and justifies CT imaging in nearly all cases. However, because word embeddings lack context, the model failed to distinguish the nature of the haematoma. The term “superficial” appeared only once in our undersampled dataset.

In general, our study would have benefited from a larger, balanced, real dataset to demonstrate the advantages of DL and transfer learning. Ideally, model tuning should be performed on a validation set to guard from overfitting and ensure robustness. Nonetheless, all classifiers exhibited impressive overall accuracy and strong iGuide generalisation capabilities. As imaging referrals typically include only a brief synopsis of the patient’s clinical information, it can be difficult to understand the physician’s decision-making without a full review of the patient’s health record. Since our dataset lacked information on the source of referrals, it was unfeasible to perform context comparisons. The clinical sites involved here request generic brain CT for related referrals, with specific protocolling (i.e., non-contrast, contrast-enhanced, angiography, etc.) performed by the radiology team on review of the provided clinical information, hence training separate prediction models for specific types of brain CT was unfeasible.

Further work is needed to delve into transfer learning in more detail and train a large language model specific to the referring language to showcase the superiority of contextual embeddings. A large-scale prediction model training would address the gap that currently exists. Future research should consider investigating the value of NLP in interpreting unstructured clinical indications. This would enhance collaboration between human radiologists, who contribute content knowledge and the ability to find near-optimal solutions, and AI, which provides knowledge based on available training samples. Ideally, these should be numerous and diverse to cover all variants, aiming for better augmentation rather than as a replacement for human experts [28]. It is vital to address practical principles prior to clinical implementation to overcome the ethical and legal issues associated with such applications [29]. Including other types of CT scans and modalities would be the next reasonable step to consider.

In conclusion, the overuse of diagnostic imaging, in particular CT, is resulting in a substantial number of unjustified examinations. ML and DL have the potential to streamline the justification process, generalise across multiple clinical sites, and assist referrers in choosing appropriate diagnostic imaging. Therefore, this makes large-scale retrospective and prospective vetting of radiology referrals feasible. Consequently, the efficiency and usage of the existing support tools, such as clinical referral guidelines, can be improved via AI approaches.

留言 (0)

沒有登入
gif