Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study


Introduction

Nursing licensure examinations are essential for maintaining professional standards, ensuring that health care systems are staffed with qualified professionals, and safeguarding patient safety []. These examinations assess nurses’ clinical judgment, decision-making, and practical skills, ensuring high-quality care and fostering public trust in the profession []. Upholding rigorous standards is critical, as competent health care professionals are crucial for addressing the diverse and complex needs of patients worldwide. The Chinese National Nursing Licensing Examination (CNNLE) plays an important role in maintaining high standards of nursing care in China, ensuring that graduates are well prepared for professional practice []. Serving as a benchmark for nursing competence, the CNNLE confirms that nurses possess the necessary skills and knowledge to provide safe and effective care []. Beyond its impact on health care quality, the CNNLE also influences educational policies, guiding nursing curricula to meet evolving health care demands. As health care becomes more complex, innovative tools are needed to support the development of skilled professionals capable of providing effective patient care.

The integration of artificial intelligence (AI) in education is transforming learning and assessment, particularly in fields such as nursing [-]. ChatGPT, an AI tool that generates content by identifying patterns in its training data, simulates human-like conversations and answers questions across a wide range of topics [,]. Its ability to provide correct answers and offer immediate, detailed feedback makes it a valuable resource for students in simulated test environments and question banks []. This success in examination settings has sparked interest in using ChatGPT as a self-learning tool, suggesting its potential for enhancing examination preparation and knowledge development []. Large language models (LLMs) hold promise for clinical education [], where these models integrate natural language processing with user-friendly interfaces []. In clinics, LLMs are increasingly valuable, particularly in diagnosis [,] and clinical licensing examinations [], where accuracy is crucial. Tools such as ChatGPT are being recognized for their potential to enhance clinical documentation [], improve diagnostic accuracy [], and streamline patient care workflows []. However, the rapid development of LLMs presents significant challenges in assessing their reliability in the CNNLE.

Passing CNNLE demands not only theoretical knowledge but also clinical decision-making, critical thinking, and practical skills, areas where LLMs often underperform []. While tools such as ChatGPT have demonstrated an overall accuracy of 80.75% in nursing education [], their effectiveness diminishes with complex, context-specific questions requiring nuanced medical knowledge [-]. Moreover, concerns regarding patient privacy [] and biases [] in LLM outputs raise questions about their suitability for high-stakes assessments such as the CNNLE, which emphasize fairness and accuracy [,]. Despite the growing interest in LLMs for medical education, their potential in the CNNLE remains unexplored. Limited understanding exists regarding their ability to handle clinical reasoning, contextual interpretation, and multistep problem-solving in this specific setting. Addressing this gap is crucial to assess their reliability, limitations, and transformative potential in clinical education. Here, this study examines the distribution of question types in the CNNLE from 2019 to 2023 and evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5—in addressing domain-specific nursing knowledge and clinical decision-making. Furthermore, the study explores whether combining their outputs through machine learning techniques can enhance overall accuracy in this context.


MethodsStudy Design

This retrospective cross-sectional study evaluated the performance of 7 LLMs on 1200 multiple-choice questions (MCQs) from the CNNLE administered between 2019 and 2023. The study design was chosen for its suitability in systematically analyzing preexisting datasets and providing the capabilities of LLMs across various question types and levels of complexity. A head-to-head evaluation approach was adopted to compare the LLMs. Each MCQ was independently input into each model under identical conditions, ensuring consistency and fairness in the assessment. This parallel evaluation minimized variability caused by external factors, such as differences in question formats or content, allowing for a direct comparison of performance across all models. By using historical data and using a head-to-head evaluation, this study provides an analysis of LLM performance in nursing licensure examinations, offering their potential applications in nursing education and assessment.

Data Collection

This study analyzed all 1200 MCQs from the CNNLE administered between 2019 and 2023. Each year, 240 MCQs were included, encompassing the 4 question types (A1, A2, A3, and A4) although their proportions varied annually. This comprehensive approach ensured that the evaluation covered diverse question formats and varying levels of complexity, reflecting the full scope of the CNNLE. To ensure the integrity of the evaluation process, 2 researchers (SZ and WH) independently entered each question into 7 LLMs on separate computers. Each question was input into a new chat session to prevent any influence from prior interactions. The LLMs generated answers and explanations solely based on the input questions without pretraining instructions or additional prompts.

If inconsistencies were detected in the responses, a third computer was used to reenter the question in a fresh chat session after clearing the LLMs’ memory. In such cases, the models were instructed to provide more detailed explanations. The researchers then collectively reviewed the answers and explanations to determine the most accurate and contextually appropriate response. When LLMs exhibited confusion, failed to provide explanations, produced multiple answers including the correct one, or encountered specialized queries (eg, questions on local policies), additional instructions were provided. These instructions included prompts such as, “This is a single-choice question. Please select the most suitable or probable answer from options 1 to 5,” “Please choose the incorrect option,” “Tell me the reason why,” “In Chinese local policy,” “In Chinese local law,” and “In Chinese society.” All data generated or analyzed during this study are provided in , and the iPython Jupyter notebook code is available in .

Ethical Considerations

The evaluations were conducted between May 15, 2024, and July 17, 2024. All responses were cross-verified against the official CNNLE answer keys. These measures enhanced the reliability and validity of the evaluation process. As this study was purely analytical and did not involve human participants, institutional review board approval and informed consent were not required. All collected data were fully anonymized by removing names, contact details, and other direct identifiers, ensuring no means to reidentify participants.

MeasurementsThe CNNLE

The CNNLE [] comprises 2 sections: Professional Practice and Practical Skills, each with 120 questions per unit. The Professional Practice section evaluates a candidate’s ability to implement nursing-related knowledge in clinical settings in a safe and effective manner. It covers medical knowledge related to health and disease, basic nursing skills, and the application of social and humanistic knowledge in nursing practice. The Practical Skills section assesses candidates’ capability to apply nursing knowledge and skills in performing nursing tasks. Topics include clinical manifestations of diseases, treatment principles, health assessment, nursing procedures, professional nursing techniques, and health education. The examination format involves objective questions presented in a computer-based format.

The examination includes 4 question types: A1, A2, A3, and A4, all of which are MCQs. A1 and A2 questions are relatively straightforward, focusing on single knowledge points and brief clinical case summaries, respectively. A3 and A4 questions involve shared clinical scenarios, requiring candidates to analyze and synthesize information comprehensively. A3 questions present 2-3 distinct, patient-centered clinical situations, while A4 questions depict more complex scenarios involving a single patient or family, with 4-6 independent questions that may introduce new information sequentially to test clinical integration skills.

LLM Selection

We selected 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5. This diverse selection enabled a comprehensive examination of LLM performance under standardized conditions. GPT-3.5, developed by OpenAI and released in March 2022, is known for generating coherent and contextually relevant text. GPT-4.0, released by OpenAI in March 2023, offers significant improvements in accuracy and understanding. GPT-4o, introduced in May 2024, is an optimized version of GPT-4.0, designed for enhanced performance. ERNIE Bot-3.5, created by Baidu and released in June 2023, is tailored for understanding and generating text in Chinese. SPARK, developed by iFLYTEK and launched in May 2023, enhances performance tools by providing intelligent assistance. Qwen-2.5, created by Alibaba and launched in May 2024, is optimized for complex language understanding, particularly in shopping and customer support contexts. To ensure effectiveness and reliability, each inquiry was conducted only once in a new chat session with each LLM, using 2 different computers. This approach aimed to evaluate the designs’ efficiency in real-world situations without the influence of responses loopholes.

Machine Learning Models

We selected 9 machine learning models, each with recognized performance in classification tasks. Logistic Regression (LR) [] is a fundamental linear model made use of for binary category tasks as a result of its simpleness and interpretability. Support Vector Machine (SVM) [] excels in high-dimensional and complicated settings, giving durable classification efficiency. Multilayer Perceptron (MLP) [] is a neural network model that properly identifies complicated patterns with its split structure. The k-nearest neighbors (KNN) [] algorithm is an uncomplicated, nonparametric monitored knowing approach that categorizes or predicts information factors based upon their proximity to bordering points, extensively acknowledged for its simplicity and effectiveness in both category and regression jobs.

Ensemble models improve prediction performance by incorporating multiple designs to alleviate overfitting and improve generalization. Random Forest (RF) [] is esteemed for its high precision and ability to alleviate overfitting with ensemble knowing, accumulating the predictions of choice trees using a majority ballot to enhance anticipating toughness. Light Gradient-Boosting Machine (LightGBM) [] is a highly reliable gradient increasing framework that makes use of a histogram-based technique to bin constant features, speeding up training speed, enhancing memory usage, and mastering processing massive datasets with impressive rate and effectiveness. Adaptive Boosting (AdaBoost) [] prioritizes tough situations, enhancing category precision by iteratively changing weights to boost the design. Extreme Gradient Boosting (XGBoost) [], a sophisticated slope-boosting system developed by Chen, iteratively refines designs by splitting tree nodes and suitable residuals, demonstrating extraordinary scalability and superior efficiency throughout varied applications. CatBoost [], introduced in 2018, is a sophisticated gradient boosting algorithm known for its outstanding handling of specific functions, reduced training times, and the use of a money-grubbing technique to pinpoint ideal tree divides, thereby improving forecast precision.

Statistical Analysis

The statistical analysis was conducted using Python 3.11.5 (Python Software Foundation) within the Microsoft Visual Studio Code environment. In preparing the dataset, responses where the LLMs failed to provide any answer were categorized as missing values and coded as –1. For valid responses labeled (A, B, C, D, and E), a numerical encoding scheme was applied, converting them to (1, 2, 3, 4, and 5), respectively. To prepare the data for machine learning algorithms, the dataset underwent normalization, scaling all features to a range between 0 and 1 using the MinMaxScaler from the Scikit-learn library. Descriptive statistics were used to analyze the distribution of question types within the CNNLE dataset from 2019 to 2023. Furthermore, accuracy percentages for the LLMs were computed across 2 distinct subjects and 4 different question types. Various machine learning models were then used with the objective of enhancing predictive performance.

Nine machine learning models, including LR, SVM, MLP, KNN, RF, LightGBM, AdaBoost, XGBoost, and CatBoost, were trained specifically for this task using the processed CNNLE dataset. None of the models were pretrained; instead, they were trained and optimized using hyperparameter tuning tailored to the dataset. For instance, parameters such as the number of trees and maximum depth were adjusted for RF, while learning rates and boosting parameters were optimized for LightGBM and XGBoost. The leave-one-out cross-validation method was used to ensure robustness and reliability. The dataset was split into training (90%) and testing (10%) sets, with the training set further divided into 9 subsets for hyperparameter tuning. This iterative process was repeated until each subset served as a validation set, minimizing overfitting and ensuring robust performance metrics for the models.

Model performance was assessed using correlation heatmaps, area under the curves (AUCs), and 7 evaluation metrics: AUC, sensitivity, specificity, F1-score, accuracy, positive predictive value (PPV), and negative predictive value (NPV). Feature importance was analyzed using Shapley Additive Explanations (SHAP), providing the contributions of individual features. SHAP analysis focused on understanding the relative contributions of the 7 LLMs, highlighting how each LLM’s accuracy influenced overall predictions. The analysis used Python packages, including pandas 2.1.4, numpy 1.24.3, scikit-learn 1.3.0, scipy 1.11.4, catboost 1.2, LightGBM 4.1.0, seaborn 0.12.2, SHAP 0.42.1, and matplotlib 3.8.0.


ResultsDistribution of Question Types in the CNNLE Over the Years

illustrates the distribution of question types over the years in both sections of the CNNLE. A depicts the distribution of question types over the years in the Practical Skills section. In the Practical Skills section, A1-type questions decreased from 86 in 2019 to 59 in 2023, while A2-type questions increased from 18 to 43. A3-type and A4-type questions showed smaller fluctuations. B shows the distribution of question types over the years in the Professional Practice section. In the Professional Practice section, A1-type questions fell from 67 in 2019 to 55 in 2023, while A2-type questions increased from 33 to 45. A3-type questions remained relatively stable, with minor variations.

Figure 1. Distribution of question types in CNNLE Professional Practice and Practical Skills sections (2019-2023). (A) Distribution of question types in Practical Skills from 2019 to 2023. (B) Distribution of question types in Professional Practice from 2019 to 2023. Accuracy of LLMs in Professional Practice

presents the accuracy of LLMs in the Professional Practice section from 2019 to 2023. In 2023, Qwen-2.5 achieved the highest accuracy (0.850), followed by ERNIE Bot-3.5 (0.808) and GPT-4o (0.783). GPT-4.0 consistently outperformed GPT-3.5 in all years, with scores of 0.725 and 0.492, respectively, in 2023. Copilot and SPARK also showed moderate performance improvements over time, reaching 0.775 and 0.692 in 2023. Across the 5 years, Qwen-2.5 demonstrated the best overall accuracy (0.875), followed by GPT-4o (0.803) and ERNIE Bot-3.5 (0.785).

Table 1. Accuracy of large language models in Professional Practice.YearGPT-3.5GPT-4.0GPT-4oCopilotERNIE Bot-3.5SPARKQwen-2.520230.4920.7250.7830.7750.8080.6920.85020220.4500.6750.8330.7670.7580.6670.91720210.5170.6830.8170.7250.7830.6500.90020200.5000.7080.7250.7330.7670.6000.85020190.5500.7580.8580.5830.8080.6000.858Overall0.5020.7100.8030.7170.7850.6420.875Accuracy of LLMs in Practical Skills

presents the accuracy of LLMs in the Practical Skills section from 2019 to 2023. In 2023, Qwen-2.5 achieved the highest accuracy (0.908), followed by GPT-4o (0.833) and Copilot (0.792). GPT-4.0 and ERNIE Bot-3.5 both scored 0.775, showing steady improvement compared with earlier years. SPARK and GPT-3.5 performed moderately, with scores of 0.758 and 0.550, respectively. Over the 5 years, Qwen-2.5 consistently outperformed other models, achieving the highest overall accuracy (0.903). GPT-4o followed with 0.810, while ERNIE Bot-3.5 ranked third with 0.777.

Table 2. Accuracy of large language models in Practical Skills.YearGPT-3.5GPT-4.0GPT-4oCopilotERNIE Bot-3.5SPARKQwen-2.520230.5500.7750.8330.7920.7750.7580.90820220.4670.6920.8000.7920.7920.6750.85020210.4670.7080.8500.6670.7500.5670.94220200.4750.6420.7830.5920.8000.7000.93320190.4830.6580.7830.4580.7670.5920.883Overall0.4880.6950.8100.6600.7770.6580.903Accuracy of LLMs for Question Types

indicates the accuracy of LLMs across 4 question types (A1, A2, A3, and A4) from 2019 to 2023. In 2023, Qwen-2.5 achieved the highest accuracy for A1, A2, and A3 questions (0.860, 0.909, and 0.853, respectively), while all models reached perfect accuracy (1.000) for A4 questions. GPT-4o consistently performed well across all question types, ranking second or third in accuracy. In 2022, Qwen-2.5 maintained high performance across A1, A2, and A3 questions (0.913, 0.810, and 0.963, respectively). From 2019 to 2021, Qwen-2.5 demonstrated steady improvements across all question types. Overall, Qwen-2.5 achieved the highest average accuracy (0.889), followed by GPT-4o (0.807) and ERNIE Bot-3.5 (0.781).

Table 3. Accuracy of large language models for 4 question types.Question typeGPT-3.5GPT-4.0GPT-4oCopilotERNIE Bot-3.5SPARKQwen-2.52023
A10.5260.6840.7890.7630.7630.7190.860
A20.4430.8070.8300.7840.8070.7050.909
A30.6470.7940.7940.8240.8240.7650.853
A41.0001.0001.0001.0001.0001.0001.0002022
A10.4760.7460.8810.8250.8170.6980.913
A20.4050.5820.6710.6960.6710.6200.810
A30.5190.6670.8890.7780.8520.6300.963
A40.5000.7501.0000.8750.8750.8750.8752021
A10.4670.6600.8200.6670.7270.6000.927
A20.5380.7380.8460.7380.8310.6460.938
A30.5240.7620.8570.7140.8570.5710.810
A40.5001.0001.0001.0000.7500.5001.0002020
A10.4780.6750.7710.6500.7710.6560.879
A20.4920.6950.7630.6950.8140.6780.915
A30.5500.6000.5500.6500.7500.6000.900
A40.5000.7501.0000.7501.0000.2501.0002019
A10.4900.6800.8040.5030.7910.6010.869
A20.5690.7450.8430.5880.7840.5880.843
A30.5560.7780.8610.5000.7780.5830.917Overall0.4950.7030.8070.6880.7810.6500.889Correlation Heatmap and AUC Curves Using Machine Learning

provides an analysis of the correlation heatmap and AUC curves for machine learning models. A presents the correlation heatmap, where Qwen-2.5 shows the highest correlation with correct answers (r=0.859), while GPT-3.5 shows the lowest correlation (r=0.402). B illustrates the AUC scores for each machine learning model in the multiclass classification task. The models achieved the following AUC scores: LR (0.946), SVM (0.980), RF (0.976), KNN (0.930), MLP (0.973), LightGBM (0.963), AdaBoost (0.962), XGBoost (0.961), and CatBoost (0.970).

Figure 2. Correlation heatmap and AUC curves of machine learning models in CNNLE. (A) Correlation heatmap: The heatmap illustrates the relationships between different LLMs. The lower left displays numerical correlation values, while the upper right represents correlation magnitude through circle size. Color gradients range from blue (low correlation) to red (high correlation), providing a visual summary of metric interdependencies. (B) AUC curves: The AUC curves compare the performance of various machine learning and ensemble models, highlighting their classification accuracy across the data set. AdaBoost: Adaptive Boosting; AUC: area under the curve; CatBoost: Categorical Boosting; KNN: k-nearest neighbor; LightGBM: Light Gradient-Boosting Machine; LR: Logistic Regression; MLP: Multilayer Perceptron; RF: Random Forest; SVM: Support Vector Machine; XGBoost: Extreme Gradient Boosting. Metrics for 5-Class Classification Using Machine Learning

presents a comparative analysis of 9 machine learning models for multiclass classification, evaluated by average metrics including AUC, accuracy, sensitivity, specificity, precision, PPV, F1-score, and NPV. Among these, the SVM and XGBoost models achieve AUC values of 0.980 and 0.961, along with accuracy scores of 0.858 and 0.908, respectively. In contrast, LR and KNN exhibit lower accuracy scores of 0.817 and 0.767.

Table 4. Metrics of machine learning.ClassifierAUCaAccuracySensitivitySpecificityPrecisionPPVbF1-scoreNPVcLRd0.9460.8170.8080.9530.8180.8180.8090.954SVMe0.9800.8580.8570.9650.8610.8610.8540.964RFf0.9760.8580.8600.9650.8560.8560.8540.964KNNg0.9300.7670.7720.9420.7870.7870.7680.941MLPh0.9730.8250.8230.9570.8300.8300.8190.956LightGBMi0.9630.9000.9080.9750.8950.8950.8990.974AdaBoostj0.9620.8580.8590.9640.8550.8550.8560.964XGBoostk0.9610.9080.9050.9780.9010.9010.9010.977CatBoostl0.9700.8920.8920.9740.8850.8850.885 0.973

aAUC: area under the curve.

bPPV: positive predictive value.

cNPV: negative predictive value.

dLR: Logistic Regression.

eSVM: Support Vector Machine.

fRF: Random Forest.

gKNN: k-nearest neighbor.

hMLP: Multilayer Perceptron.

iLightGBM: Light Gradient-Boosting Machine.

jAdaBoost: Adaptive Boosting.

kXGBoost: Extreme Gradient Boosting.

lCatBoost: Categorical Boosting.

Importance Ranking of SVM and XGBoost Models

presents the SHAP summary bar plot for both the SVM and XGBoost models. In A, the SVM model ranks the features as follows: Qwen-2.5, ERNIE Bot-3.5, GPT-4o, Copilot, GPT-4.0, SPARK, and GPT-3.5. Meanwhile, B shows the importance ranking of XGBoost model with a slightly different order: Qwen-2.5, GPT-4o, ERNIE Bot-3.5, Copilot, GPT-3.5, SPARK, and GPT-4.0. Qwen-2.5 stands out as the most influential feature in both models. Furthermore, including other LLMs enhances overall model performance, evidenced by improvements in AUC and accuracy.

Figure 3. SHAP summary bar plot in Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost) models. (A) Importance ranking of SVM model. (B) Importance ranking of XGBoost model. SHAP: Shapley Additive Explanations.
DiscussionPrincipal Findings

This study is the first to evaluate the performance of 7 LLMs on the CNNLE dataset (2019-2023), highlighting significant advancements in Chinese LLM development and their applications in nursing education. Among the models tested, Qwen-2.5 demonstrated the highest accuracy (88.92%), significantly surpassing the performance of the other LLMs. This superior accuracy can be attributed to its training on an extensive Chinese dataset and optimized parameters, enabling it to handle domain-specific nursing knowledge and complex clinical decision-making tasks with exceptional precision. These results underline the growing feasibility of deploying advanced LLMs such as Qwen-2.5 to support standardized nursing examinations in China, offering consistent, scalable, and efficient assessments.

Our findings offer a clear pathway for the practical application of Qwen-2.5 in nursing curricula and professional training. For instance, Qwen-2.5 could serve as a virtual tutor, providing personalized feedback and explanations to nursing students in real time. Its ability to respond promptly and accurately to a wide range of questions makes it particularly valuable for addressing individual knowledge gaps and reinforcing complex concepts. Educators could incorporate Qwen-2.5 into classroom activities, using it to simulate clinical scenarios or evaluate students’ decision-making skills. Furthermore, mobile apps powered by Qwen-2.5 could allow nursing students to access high-quality, interactive learning resources anytime and anywhere, thereby enhancing accessibility and flexibility in education. Beyond supporting student learning, Qwen-2.5 and other LLMs can enhance professional development for practicing nurses. For instance, these models could be integrated into continuing education programs, where they act as interactive resources to update practitioners on the latest evidence-based practices. By serving as a knowledge repository, LLMs can enable nurses to quickly access relevant guidelines, ensuring timely and informed clinical decisions.

The results also extend prior research by demonstrating how ensemble machine learning methods can enhance LLM performance in specialized tasks. By integrating the outputs of 7 LLMs using the XGBoost algorithm, we achieved an improved accuracy of 90.83%, surpassing the best-performing single model. This novel application of ensemble methods highlights a promising direction for developing personalized LLMs tailored to specific domains, such as health care education. Previous studies, including those by Li et al [] and Brin et al [], have emphasized the value of context-specific tuning, but our research provides concrete evidence of the effectiveness of combining multiple models to enhance accuracy in domain-specific applications.

Furthermore, our study situates Chinese LLMs within the broader global landscape of AI. While skepticism has persisted regarding whether Chinese LLMs can rival models developed by OpenAI or Google, our results demonstrate that Qwen-2.5 not only excels on the CNNLE assessment but also outperforms other LLMs. For example, Qwen-2.5 achieved a higher accuracy than GPT-4 (72.5%) on similar standardized tests, as reported in previous studies []. This performance underscores the competitive edge of Chinese-developed models, particularly in addressing language-specific and cultural nuances in health care education.

Our findings also reveal recent trends in nursing education, such as the increasing complexity of standardized examinations such as the CNNLE, which has transitioned from straightforward A1-type questions to more analytical A2-type clinical case scenarios. This shift reflects the growing need for nursing professionals to develop higher-order reasoning skills and apply clinical knowledge to real-world situations. Qwen-2.5’s ability to process nuanced clinical scenarios positions it as an effective tool for addressing these demands. By integrating models such as Qwen-2.5 into nursing curricula, educators can better prepare students for the complexities of modern health care through scenario-based learning and real-time feedback.

Our study also demonstrates the broader impact of China’s advancements in AI, particularly through open access LLMs such as Qwen-2.5 and ERNIE Bot-3.5, which provide practical solutions for addressing regional disparities in nursing education. These models are especially valuable in regions where access to global LLMs, such as OpenAI’s GPT models, is restricted, as they deliver high-quality, localized educational content. By promoting the standardization of nursing education across institutions, these tools help bridge resource gaps and improve the overall quality of health care training in China. Furthermore, their ability to enable personalized and flexible learning through mobile apps empowers students to access educational resources anytime and anywhere. This adaptability positions Chinese LLMs as critical tools for advancing specialized education while addressing both regional inequalities and broader challenges in health care training.

Limitations

First, our evaluation relied on MCQs to assess the knowledge of LLMs, which may not fully capture their ability to handle open-ended or complex clinical tasks. Future studies could incorporate open-ended questions, clinical simulations, or case-based assessments to evaluate LLMs’ reasoning and decision-making capabilities more comprehensively. These methods would better reflect the unstructured and nuanced scenarios encountered in real-world clinical practice, providing a deeper understanding of how LLMs process complex clinical information. Second, the performance of LLMs can vary based on factors such as prompt design, the number of questions asked, and the context of those questions, introducing variability into results. To address this, standardized evaluation protocols should be developed to ensure consistency across benchmarking studies. Furthermore, future research could focus on refining prompt engineering techniques and optimizing model fine-tuning to improve accuracy and reliability in diverse clinical applications. These refinements could support the development of LLMs that are better suited to handling complex scenarios, such as differential diagnoses or multistep decision-making. Third, while Qwen-2.5 demonstrated highest accuracy on the CNNLE dataset, its optimization for the Chinese language and MCQ format may limit its generalizability to other medical domains and contexts. Future studies should evaluate its applicability in multilingual and open-ended settings to assess its effectiveness in tasks beyond standardized testing formats and within various health care contexts. To enhance the suitability of LLMs for specialized health care tasks, such as diagnostic reasoning and treatment planning, future research could prioritize the development of domain-specific models. This could involve fine-tuning LLMs on datasets that include detailed case histories, diagnostic pathways, and clinical protocols. Such datasets would allow the models to learn context-specific patterns and reasoning processes, equipping them to provide more accurate and relevant recommendations in clinical settings. Furthermore, fine-tuned models could be used to assist in treatment planning by integrating data from clinical guidelines, patient histories, and risk assessment tools to offer tailored suggestions for patient care. Addressing biases in LLM training is also essential for ensuring equitable decision-making across diverse patient populations. Researchers should consider incorporating fairness-aware algorithms and curated datasets that reflect demographic diversity to mitigate potential biases. Such efforts could ensure that domain-specific LLMs provide consistent and unbiased recommendations, particularly in high-stakes environments such as emergency department triage workflows. Finally, this study focused on general purpose LLMs, excluding models explicitly trained for medical tasks, such as Gemini or Claude. Preliminary findings suggest that fine-tuned medical models may achieve superior accuracy for specific applications. Future research should conduct comparative evaluations of general purpose and domain-specific LLMs to identify the optimal approach for different health care needs. These studies could also assess whether fine-tuned models are more effective in real-time clinical workflows, such as triage systems, or in supporting complex decision-making across various medical specialties.

Conclusions

This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training.

SZ was responsible for the conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing of the original draft, reviewing, editing the manuscript, and visualization. WH and ZY managed data curation and visualization. JY also handled data curation and contributed to reviewing and editing the manuscript. FZ supervised the project, contributed to reviewing and editing the manuscript, and managed project administration. All authors have read and approved the final version of the manuscript and agreed to be accountable for all aspects of the work, ensuring that any questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

None declared.

Edited by A Castonguay; submitted 27.06.24; peer-reviewed by Y Hirano, M Besler; comments to author 26.07.24; revised version received 06.08.24; accepted 20.12.24; published 10.01.25.

©Shiben Zhu, Wanqin Hu, Zhi Yang, Jiani Yan, Fang Zhang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 10.01.2025.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

留言 (0)

沒有登入
gif