ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology

This study demonstrated the diagnostic accuracy of GPT-4-based ChatGPT and GPT-4V-based ChatGPT in musculoskeletal radiology. The diagnostic accuracy of GPT-4-based ChatGPT (based on the patient’s medical history and imaging findings) was significantly higher than that of GPT-4V-based ChatGPT (based on the patient’s medical history and images). Regarding the comparison between ChatGPT and radiologists, GPT-4-based ChatGPT’s diagnostic accuracy was comparable to that of a radiology resident but lower than that of a board-certified radiologist. While GPT-4V-based ChatGPT’s diagnostic accuracy was significantly lower than that of both radiologists. The diagnostic accuracy of radiologists improved with GPT-4-based ChatGPT’s assistance, but not with GPT-4V-based ChatGPT’s assistance. In the analysis of GPT-4-based ChatGPT’s diagnostic accuracy per category, GPT-4-based ChatGPT’s final diagnostic accuracy rate was significantly lower for the tumor group compared to the nontumor group. Within the tumor group, the accuracy rates for the final and differential diagnoses were relatively higher for bone tumor cases compared to those of soft tissue tumor cases, although the differences were not significant.

To the best of our knowledge, this study is the first in the field of musculoskeletal radiology to investigate the diagnostic capability of GPT-4 and GPT-4V-based ChatGPTs and to compare these to radiologists’ performance. Although a previous study has reported that GPT-3-based ChatGPT can generate coherent research articles in musculoskeletal radiology [20], no study has evaluated the diagnostic performance of GPT-4 and GPT-4V-based ChatGPTs in this field. This study provides valuable insights into the strengths and limitations of using ChatGPT as a diagnostic tool in musculoskeletal radiology.

While ChatGPT holds promise as a useful tool in musculoskeletal radiology, radiologists should recognize its capabilities and exercise caution when incorporating ChatGPT into clinical practice. This study demonstrated that the diagnostic accuracy of GPT-4-based ChatGPT was significantly higher than that of GPT-4V-based ChatGPT. These results indicated that the GPT-4V-based ChatGPT’s capability to process images and extract imaging findings is insufficient. A recent study has reported that GPT-4V-based ChatGPT exhibited limited interpretive accuracy in analyzing radiological images [25]. One factor contributing to the underperformance of GPT-4V-based ChatGPT was perhaps its insufficient training in medical images. In OpenAI’s statements, they considered the current GPT-4V to be unsuitable for performing the interpretation of medical images and replacing professional medical diagnoses due to inconsistencies [5]. For further improvements of GPT-4V-based ChatGPT’s diagnostic accuracy, exploring techniques such as retrieval-augmented generation, fine-tuning with reinforcement learning from human feedback, and training vision models on a wide range of medical images should be considered [26]. Since textual information is the only feasible support option to date, providing the appropriate description of imaging findings is crucial when utilizing ChatGPT as a diagnostic tool in clinical practice. Regarding the comparison between ChatGPT and radiologists, GPT-4V-based ChatGPT’s diagnostic performance was significantly lower than that of radiologists, and GPT-4-based ChatGPT’s diagnostic performance was comparable to that of radiology residents but did not reach the performance level of board-certified radiologists. ChatGPT may assist radiologists in the diagnostic process; however, ChatGPT alone cannot fully replace the expertise of radiologists and should only be used as an adjunct tool.

Although GPT-4-based ChatGPT alone cannot replace the expertise of radiologists, it is capable of enhancing diagnostic accuracy and assisting radiologists in narrowing down differential diagnoses as part of the diagnostic workflow in musculoskeletal radiology. Furthermore, ChatGPT has been shown to provide valuable assistance to radiologists in various tasks, including supporting decision-making, determining imaging protocols, generating radiology reports, offering patient education, and writing medical publications [26, 27]. The implementation of ChatGPT into radiological practices has the potential to optimize the diagnostic process, resulting in time savings and a decreased workload for radiologists, thereby increasing overall efficiency.

This study also revealed that the diagnostic accuracy of GPT-4-based ChatGPT may vary depending on the etiology of the disease; it was significantly lower in the tumor group compared to the nontumor group. This lower diagnostic accuracy in neoplastic diseases could be attributed to the challenging nature of interpreting complex cases, due to the wide variety of histopathological types and imaging findings [23, 28]. Rare neoplastic diseases may be more challenging for ChatGPT due to the limited literature and a lack of established typical imaging findings. Although no significant difference in diagnostic accuracy rates was observed between bone tumor and soft tissue tumor cases, bone tumor cases showed relatively higher accuracy rates compared to soft tissue tumor cases. While soft tissue tumors of both benign and malignant nature often share overlapping imaging features [29], bone tumors have grading systems that allow for the assessment of malignancy risk based on their growth patterns [30, 31]. This distinction may be one of the contributing factors to the relatively higher differential diagnostic accuracy for bone tumors compared to soft tissue tumors. On the other hand, the significantly higher accuracy rates for the final diagnosis of the nontumor group indicated that GPT-4-based ChatGPT may be particularly useful in diagnosing non-neoplastic diseases in musculoskeletal radiology. Among non-neoplastic diseases, cases of congenital/developmental abnormality and dysplasia, traumatic disease, and anatomical variants showed relatively higher final diagnostic accuracy. These relatively higher accuracies may be attributed to characteristic keywords in patient’s medical history and imaging findings for these conditions.

This study had several limitations. First, ChatGPT’s performance in generating diagnoses was conducted in the controlled environment of the “Test Yourself” cases, which may not fully represent the broader range of musculoskeletal radiology cases. This selection bias could affect the generalizability of the results and may not capture the full spectrum of diagnostic challenges encountered in real-world clinical practice. Second, the “Test Yourself” cases represent a potential for bias since these cases may have been included in the training data of ChatGPT. This bias may lead to an overestimation of ChatGPT’s diagnostic accuracy. Third, this study utilized the descriptions of imaging findings provided by authors aware of the final diagnosis in the “Test Yourself” cases. This may have introduced a bias, which could lead to an overestimation of GPT-4-based ChatGPT’s diagnostic accuracy. Further studies are necessary to mitigate this bias, including evaluating ChatGPT’s diagnostic accuracy utilizing the descriptions of imaging findings provided by radiologists blinded to the final diagnosis. Fourth, radiologists’ diagnoses with the assistance of ChatGPT may introduce a bias, potentially leading to an overestimation of ChatGPT’s capabilities as a diagnostic support tool. Fifth, this study did not conduct a categorical analysis for GPT-4V-based ChatGPT’s diagnostic accuracy due to the limited number of correct diagnoses, which limits the statistical power of the analyses. Sixth, this study did not perform a statistical analysis for ChatGPT’s diagnostic accuracy in non-neoplastic etiologies due to the limited number of cases. Finally, this study did not investigate hallucinations, a critical limitation of large language models [25, 26, 32]. Radiologists need to be aware of hallucinations when utilizing ChatGPT as a diagnostic tool in clinical practice. Further studies are necessary to explore the characteristics and mitigation strategies of hallucinations for optimal utilization of ChatGPT.

In conclusion, this study evaluated the diagnostic accuracy of both GPT-4-based ChatGPT and GPT-4V-based ChatGPT in musculoskeletal radiology. When GPT-4-based ChatGPT utilized the descriptions of imaging findings provided by distinguished radiologists, its diagnostic performance was comparable to that of radiology residents but did not reach the performance level of board-certified radiologists. In contrast, GPT-4V-based ChatGPT, which independently evaluates imaging findings, showed poor diagnostic ability. Since textual information is the only feasible support option to date, providing the appropriate description of imaging findings is crucial when utilizing ChatGPT as a diagnostic tool in clinical practice. While ChatGPT may assist radiologists in narrowing down the differential diagnosis and improving the diagnostic workflow, radiologists need to be aware of its capabilities and limitations for optimal utilization.

留言 (0)

沒有登入
gif