Comparative analysis of large language models in the Royal College of Ophthalmologists fellowship exams

This study is the first to demonstrate that publicly available LLM-driven chatbots can consistently provide accurate responses to postgraduate Ophthalmology specialty examinations, achieving an impressive accuracy of up to 82.9% without prompting or instruction tuning. This performance was independent of question topic and difficulty. Notably, most LLMs performed well enough to pass the high standards of these exams, which typically require a score of between 58% and 66% [10, 11]. Previous reports have shown that LLMs can achieve accuracies of up to 67.6% in generalist medical examinations with the use of different training data and instruction prompt tuning [7, 12].

We observed variation in the accuracy of responses between LLM-chatbots (Fig. 1), but each consistently provided similar accuracy with each iteration. Curated prompting strategies enhanced performance. LLMs demonstrated equal proficiency in answering basic science and clinical questions and performed similarly across difficulties and topics, except for Part 2 Cornea/External Eye questions, answered correctly 96% of the time (Table 1). This may reflect the use of different training data by LLMs, as our analyses accounted for question difficulty and characteristics. Limited officially-available questions precluded definitive topic-based comparisons (Supplementary Materials).

Our study has broad implications for the field of Ophthalmology, where large-scale medical AI models are being developed to aid clinical decision-making through free-text explanations, spoken recommendations, or image annotations [2]. LLMs outperformed our specialist examinations, raising questions about the adequacy of traditional assessments in measuring clinical competence. Alternative assessment methods, such as simulations or objective structured clinical examinations, may be needed to better capture the multifaceted skills and knowledge required for clinical practice.

Medical AI technology has great potential, but it also poses limitations and challenges. Clinicians may hold the AI system to a high standard of accuracy, creating barriers to effective human-machine collaboration. Responsibility for the answers generated by these technologies in a clinical setting is unclear; our testing revealed that LLMs could provide incorrect explanations and answers without the ability to recognise their own limitations [6]. Additionally, the use of LLMs for clinical purposes is restricted by inherent biases in data and algorithms used, raising major concerns [2, 6]. Ensuring the explainability of AI systems is a potential solution to this problem, and an interesting research topic. Issues related to validation, computational expenses, data procurement, and accessibility must also be addressed [2].

AI systems will become increasingly integrated into online learning and clinical practice, highlighting the need for ophthalmologists to develop AI literacy. Future research should focus on building open-access LLMs trained specifically with truthful Ophthalmology data to improve accuracy and reliability. Overall, LLMs offer significant opportunities to advance ophthalmic education and care.

留言 (0)

沒有登入
gif