Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

Introduction

Since ChatGPT’s public release in November 2022, large language models (LLMs) have attracted great interest in medical imaging applications []. Research indicated that ChatGPT showed promising applications in various aspects of the medical imaging process. Even without radiology-specific pretraining, LLMs can pass board examinations [], provide radiology decision support [], assist in differential diagnosis [-], and generate impressions from findings or structured reports [-]. These applications not only accelerate the imaging diagnosis process and alleviate the workload of doctors but also improve the accuracy of diagnosis []. However, limitations exist, with 1 study showing ChatGPT-3 producing erroneous answers for a third of daily clinical questions and about 63% of provided references were not found []. ChatGPT’s dangerous tendency to produce inaccurate responses is less frequent in GPT-4 but still limits usability in medical education and practice at present []. Tailoring LLMs to radiology may enhance reliability, as an appropriateness criteria context aware chatbot outperformed generic chatbots and radiologists [].

The American College of Radiology Reporting and Data Systems (RADS) standardizes communication of imaging findings. As of August 2023, there have been 9 disease-specific systems endorsed by the American College of Radiology, referring to products from the lexicons to report templates []. RADS reduces terminology variability, facilitates communication between radiologists and referring physicians, allows consistent evaluations, and conveys clinical significance to improve care. However, complexity and unfamiliarity limit adoption. Consequently, endeavors should be pursued to broaden the implementation of RADS. Therefore, we conducted this study to evaluate LLM’s capabilities on a focused RADS assignment task for radiology reports.

A prompt serves as a directive or instruction given to LLMs to generate a particular response. The technique of “prompt tuning” has emerged as a valuable approach to refine the performance of LLMs, particularly for specific domains or tasks []. By providing structured queries or exemplary responses, the output of chatbots can be tailored for accurate and relevant answers. Such prompt-tuning strategies leverage LLMs’ knowledge while guiding appropriate delivery for particular challenges []. Given the complexity and specificity of the RADS categorization, our investigation emphasizes different prompt impacts to assess chatbot capabilities and potential performance enhancement through refined prompting tuning.

In this study, our primary objective was to meticulously evaluate the performance of 3 LLMs (GPT-3.5, GPT-4, and Claude-2) for RADS categorization using different prompt-tuning strategies. We aimed to test their accuracy and consistency in RADS categorization and shed light on the potential benefits and limitations of relying on chatbot-derived information for the categorization of specific RADS.

MethodsEthical Considerations

As the study was based on radiological data that were artificially generated by radiologists and did not involve the participation of human subjects, the study was determined to be exempt from ethical review, in accordance with the regulations established by the institutional review board of Henan Provincial People’s Hospital.

Study Design

The workflow of the study is shown in . We conducted a cross-sectional analysis in September 2023 to evaluate the competency of 3 chatbots—GPT-3.5, GPT-4 (OpenAI, August 30, 2023 version) [], and Claude-2 (Anthropic) []—in the task of assigning 3 RADS categorizations to radiology reports. Given the chatbot’s knowledge cessation was as of September 2021, we opted for Liver Imaging Reporting & Data System (LI-RADS) version 2018 [], Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022 [], and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging (MRI) (developed in 2022) [] as the yardsticks to compare the responses engendered by GPT-3.5, GPT-4, and Claude-2. A total of thirty radiology reports for either CT or MRI examinations were composed for this analysis, with 10 cases representing each of the 3 RADS reporting systems. The radiology reports used for testing were generated by radiologists with more than 10 years’ experience to correct the wording styles from real-life cases based on respective RADS systems. For each RADS (ie, LI-, Lung-, and O-RADS), we attempted to reflect the complexity and diversity so that the reports cover typical cases in clinical practice. Therefore, reports with 2-3 simple cases and 7-8 challenging cases were generated for 1 RADS. These include scenarios such as prior examination comparison, the presence of multiple nodules, extensive categorization under different RADS systems, and updates from the most recent LI-RADS and Lung-RADS guidelines. The characteristics of radiology reports for each RADS and the distribution of the number of the reports across the 3 RADS are shown in . The objective was to evaluate the performance of chatbots on a highly structured radiology workflow task involving cancer risk categorization based on structured report inputs. The study design focused on a defined use case to illuminate the strengths and limitations of existing natural language-processing technology in this radiology subdomain.

‎

Figure 1. Flowchart of the study design. CT: computed tomography; LI-RADS: Liver Imaging Reporting & Data System; Lung-RADS: Lung CT Screening Reporting & Data System; MRI: magnetic resonance imaging; O-RADS: Ovarian-Adnexal Reporting & Data System; RADS: Reporting and Data Systems. Prompts

We collected and analyzed responses from GPT-3.5, GPT-4, and Claude-2 for each case. To mitigate bias, the radiological findings were presented individually via separate interactions, with corresponding responses saved for analysis. Three prompt templates were designed to elicit each RADS categorization along with explanatory rationale: Prompt-0 was a zero-shot prompt, merely introducing the RADS assignment task, such as “Your task is to follow Lung-RADS version 2022 guideline to give Lung-RADS category of the radiological findings delimited by angle brackets.”

Prompt-1 was a few-shot prompt, furnishing an exemplar of RADS categorization including the reasoning, summarized impression, and final category. The following is an example:

Your task is to follow Lung-RADS version 2022 guideline to give Lung-RADS category of the radiological findings delimited by angle brackets. “”“ < …Radiological Findings… > Answer：Rationale: Overall: Summary: Lung-RADS Category: X ”“”

Prompt-2 distinctly instructed chatbots to consult the PDF of corresponding RADS guidelines, compensating for these chatbots’ lack of radiology-specific pretraining. For Claude-2, the PDF could be directly ingested, while GPT-4 required the use of an “Ask for PDF” plug-in to extract pertinent information [,].

Each case was evaluated 6 times with each chatbot across the 3 prompt levels. The representative radiological reports and prompts are shown in . The links to all the prompts and guideline PDFs are shown in .

Evaluation of Chatbots

Two study authors (QW and HL) independently evaluated the following for each chatbot response in a blinded manner, with any discrepancies resolved by a third senior radiologist (YW). The following were assessed for each response:

Patient-level RADS categorization: judged as correct, incorrect, or unsure. “Correct” denotes that the chatbot accurately identified the patient-level RADS category, irrespective of the rationale provided. “Unsure” denotes that the chatbot’s response failed to provide a decisive RADS category. For example, a response articulating that “a definitive Lung-RADS category cannot be assigned” would be categorized as “unsure.”Overall rating: assessed as either correct or incorrect. A response is judged incorrect if any errors (Es) are identified, including the following:E1: a factual extraction error that denotes the chatbots’ inability to paraphrase the radiological findings accurately, consequently misinterpreting the information.E2: hallucination, encompassing the fabrication of nonexistent RADS categories (E2a) and RADS criteria (E2b).E3: a reasoning error, which includes the incapacity to logically interpret the imaging description (E3a) and the RADS category accurately (E3b). The subtype errors for reasoning imaging description include the inability to reason lesion signal (E3ai), lesion size (E3aii), and enhancement (E3aiii) accurately.E4: an explanatory error, encompassing inaccurate elucidation of RADS category meaning (E4a) and erroneous explanation of the recommended management and follow-up corresponding to the RADS category (E4b).

If a chatbot’s feedback manifested any of the aforementioned infractions, it was labeled as incorrect, with the specific type of error documented. To assess the consistency of the evaluations, a k-pass voting method was also applied. Specifically, a case was deemed accurately categorized if it met the criteria in a minimum of 4 out of the 6 runs.

Statistical Analyses

The accuracy of the patient-level RADS categorization and overall rating for each chatbot was compared using the chi-square test. The agreement across the 6 repeated runs was assessed using Fleiss κ. Agreement strength was interpreted as follows: <0 signified poor, 0-0.20 indicated slight, 0.21-0.40 represented fair, 0.41-0.60 was interpreted as moderate, 0.61-0.80 denoted substantial, and 0.81-1 was characterized as almost perfect. Statistical significance was defined as 2-sided P<.05. All analyses were performed using R statistical software (version 4.1.2; The R Foundation).

ResultsPerformance of Chatbots

The performance of chatbots is shown in and and , with the links to case-level details provided in . For the overall rating (, average row and A), Claude-2 with prompt-2 demonstrated significantly higher average accuracy across the 30 cases than Claude-2 with prompt-0 (odds ratio [OR] 8.16; P<.001). GPT-4 with prompt-2 also showed improved average accuracy compared with GPT-4 with prompt-0, but the difference was not statistically significant (OR 3.19; P=.13). When using the k-pass voting method (, k-pass voting row), Claude-2 with prompt-2 had significantly higher accuracy than Claude-2 with prompt-0 (OR 8.65; P=.002). Similarly, GPT-4 with prompt-2 was significantly more accurate than GPT-4 with prompt-0 (OR 11.98; P=.01). For the exact assignment of the patient-level RADS categorization (, average row and B), Claude-2 with Prompt-2 showed significantly more average accuracy than Claude-2 with prompt-0 (P=.04).

‎

Figure 2. Bar graphs show the comparison of chatbot performance across 6 runs regarding (A) overall rating and (B) patient-level Reporting and Data Systems categorization. Table 1. Correct overall ratings of different chatbots and prompts.Chatbots and promptsPrompt-0, n (%; 95% CI)Prompt-1, n (%; 95% CI)Prompt-2, n (%; 95% CI)GPT-3.5
Run 13 (10; 3-28)9 (30; 15-50)N/Aa
Run 23 (10; 3-28)9 (30; 15-50)N/A
Run 34 (13; 4-32)7 (23; 11-43)N/A
Run 44 (13; 4-32)5 (17; 6-35)N/A
Run 53 (10; 3-28)6 (20; 8-39)N/A
Run 63 (10; 3-28)4 (13; 4-32)N/A
Averageb3 (10; 3-28)7 (23; 11-43)N/A
K-pass votingc1 (3; 0-19)2 (7; 1-24)N/AGPT-4
Run 14 (13; 4-32)11 (37; 21-56)12 (40; 23-59)
Run 24 (13; 4-32)7 (23; 11-43)8 (27; 13-46)
Run 34 (13; 4-32)9 (30; 15-50)9 (30; 15-50)
Run 42 (7; 1-24)9 (30; 15-50)13 (43; 26-62)
Run 55 (17; 6-35)11 (37; 21-56)8 (27; 13-46)
Run 66 (20; 8-39)9 (30; 15-50)8 (27; 13-46)
Averageb4 (13; 4-32)9 (30; 15-50)10 (33; 18-53)
K-pass votingc1 (3; 0-19)6 (20; 8-39)9 (30; 15-50)dClaude-2
Run 14 (13; 4-32)10 (33; 18-53)19 (63; 44-79)
Run 25 (17; 6-35)8 (27; 13-46)16 (53; 35-71)
Run 35 (17; 6-35)7 (23; 11-43)15 (50; 33-67)
Run 45 (17; 6-35)6 (20; 8-39)17 (57; 38-74)
Run 53 (10; 3-28)7 (23; 11-43)18 (60; 41-77)
Run 63 (10; 3-28)7 (23; 11-43)14 (47; 29-65)
Averageb4 (13; 4-32)8 (27; 13-46)17 (57; 38-74)d
K-pass votingc3 (10; 3-28)7 (23; 11-43)15 (50; 33-67)d

aN/A: not applicable.

bAccuracy by the average method.

cAccuracy by k-pass voting (≥4/6 runs correct).

dSignificant between prompt-0 and prompt-2.

Table 2. The number of correct, incorrect, and unsure responses for patient-level Reporting and Data Systems categorization across different chatbots and prompts.Chatbots and promptsCorrect/incorrect/unsure patient-level Reporting and Data Systems categories, n/n/n
Run 1Run 2Run 3Run 4Run 5Run 6AverageaK-pass votingbGPT-3.5
Prompt-07/23/07/23/07/23/09/21/08/21/18/20/28/22/07/23/0
Prompt-113/15/211/19/08/21/18/21/111/19/08/22/010/20/07/23/0GPT-4
Prompt-010/20/08/19/39/20/18/22/016/14/013/15/211/18/18/22/0
Prompt-115/14/110/18/211/18/114/15/115/14/112/18/013/16/111/19/0
Prompt-213/16/111/18/112/18/014/16/09/21/011/16/312/18/011/19/0Claude-2
Prompt-013/17/012/18/012/18/015/15/010/20/09/21/012/18/013/17/0
Prompt-111/19/014/16/011/19/011/19/013/17/012/18/012/18/111/19/0
Prompt-221/9/021/9/020/10/022/8/021/9/02021/8/121/9/0c21/9/0

aAccuracy by the average method.

bAccuracy by k-pass voting (≥4/6 runs correct).

cSignificant between prompt-0 and prompt-2.

Consistency of Chatbots

As shown in , among the 30 cases evaluated in 6 runs, Claude-2 with prompt-2 showed substantial agreement (k=0.65 for overall rating; k=0.74 for RADS categorization). GPT-4, when interfaced with prompt-2, demonstrated moderate agreement (k=0.46 for overall rating; k=0.41 for RADS categorization). When evaluated with prompt-1, GPT-4 presented moderate agreement (k=0.38 for overall rating; k=0.42 for RADS categorization). In contrast, Claude-2 showed substantial agreement (k=0.63 for overall rating; k=0.61 for RADS categorization), while GPT-3.5 exhibited a range from slight to fair agreement. With prompt-0, Claude-2 showed moderate agreement (k=0.49) for overall rating and substantial agreement for RADS categorization (k=0.65). GPT4 manifested slight agreement (k=0.19) for the overall rating and fair agreement for RADS categorization. Meanwhile, GPT-3.5 showed fair agreement (k=0.28) for the overall rating and moderate agreement (k=0.57) for RADS categorization.

Table 3. The consistency of different chatbots and prompts among 6 runs.Prompt-0, Fleiss κ (95% CI)Prompt-1, Fleiss κ (95% CI)Prompt-2, Fleiss κ (95% CI)All, Fleiss κ (95% CI)Patient-level RADSacategorization
GPT-3.50.57 (0.48-0.65)0.24 (0.15-0.32)N/Ab0.39 (0.33-0.46)
GPT-40.33 (0.25-0.42)0.42 (0.34-0.5)0.41 (0.33-0.5)0.39 (0.34-0.44)
Claude-20.65 (0.56-0.74)0.61 (0.52-0.7)0.74 (0.65-0.83)0.69 (0.64-0.74)Overall rating
GPT-3.50.28 (0.19-0.37)0.14 (0.05-0.23)N/A0.21 (0.14-0.27)
GPT-40.19 (0.1-0.28)0.38 (0.29-0.47)0.46 (0.37-0.55)0.39 (0.34-0.45)
Claude-20.49 (0.4-0.58)0.63 (0.53-0.72)0.65 (0.56-0.75)0.66 (0.61-0.72)

aRADS: Reporting and Data Systems.

bN/A: not applicable.

Subgroup Analysis

Since the knowledge base for ChatGPT was frozen as of September 2021, accounting for the knowledge limitations of LLMs developed before the latest RADS guideline updates, we compared the responses of different RADS criteria. The total accurate responses across 6 runs were computed for all prompts. Both GPT-4 and Claude-2 demonstrated superior performance in the context of LI-RADS CT/MRI version 2018 as opposed to Lung-RADS version 2022 and O-RADS MRI (all P<.05; ). delineates the performance of various chatbots across different prompts and RADS categories. For the overall rating (A), Claude-2 exhibited a progressive trend of enhancement of overall rating accuracy from prompt-0 to prompt-1 to prompt-2, with 20.0% (12/60), 36.7% (22/60), and 75.0% (45/60) for LIRADS; 11.7% (7/60), 18.3% (11/60), and 48.3% (29/60) for Lung-RADS; and 10.0% (6/60), 20.0% (12/60), and 41.7% (25/60) for O-RADS, respectively. Notably, with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% in older systems such as LI-RADS version 2018. Conversely, GPT-4 improved with prompt-1/2 over prompt-0, but prompt-2 did not exceed prompt-1. For the RADS categorization (B), prompt-1 and prompt-2 outperformed prompt-0 for LI-RADS, irrespective of chatbots. However, for Lung-RADS and O-RADS, prompt-0 sometimes superseded prompt-1.

Table 4. The performance of chatbots within different RADS criteriaa.Chatbots and RADSbYear of developmentRADS categorization (correct/incorrect/unsure), n/n/nP valueOverall rating (correct/incorrect), n/nP valueGPT-3.5
LI-RADSc CTd/MRIe201832/86/2Reference22/98Reference
Lung-RADSf202238/78/4.8314/106.15
O-RADSg MRI202235/84/1.4624/96.87GPT-4
LI-RADS CT/MRI2018104/74/2Reference78/102Reference
Lung-RADS202240/128/12<.00121/159<.001
O-RADS MRI202267/110/3<.00140/140<.001Claude-2
LI-RADS CT/MRI 201893/86/1Reference79/101Reference
Lung-RADS202263/117/0.00147/133<.001
O-RADS MRI2022113/67/0.0443/137<.001

aData are aggregate numbers across 6 runs.

bRADS: Reporting and Data Systems.

cLI-RADS: Liver Imaging Reporting and Data System.

dCT: computed tomography.

eMRI: magnetic resonance imaging.

fLung-RADS: Lung CT Screening Reporting and Data System.

gO-RADS: Ovarian-Adnexal Reporting and Data System.

‎

Figure 3. The performance of chatbots and prompts within different Reporting and Data Systems criteria. (A) Overall rating and (B) patient-level RADS categorization. LI-RADS: Liver Imaging Reporting and Data System; Lung-RADS: Lung CT (computed tomography) Screening Reporting and Data System; O-RADS: Ovarian-Adnexal Reporting and Data System. Analysis of Error Types

A total of 1440 cases were analyzed for error types, with details provided in . The bar plot illustrating the distribution of errors across the 3 chatbots is shown in . A typical example of factual extraction error (E1) occurred in response to the seventh Lung-RADS question. The statement “The 3mm solid nodule in the lateral basal segmental bronchus is subsegmental” is inaccurate, as the lateral basal segmental bronchus represents one of the 18 defined lung segments and not a subsegment [].

‎

Figure 4. The number of error types for different chatbots. E1: Factual extraction error denotes the chatbots’ inability to paraphrase the radiological findings accurately, consequently misinterpreting the information. E2: Hallucination, encompassing the fabrication of nonexistent Reporting and Data Systems (RADS) categories (E2a) and RADS criteria (E2b). E3: Reasoning error, which includes the incapacity to logically interpret the imaging description (E3a) and the RADS category accurately (E3b). The subtype errors for reasoning imaging description include the inability to reason lesion signal (E3ai), lesion size (E3aii), and enhancement (E3aiii) accurately. E4: Explanatory error, encompassing inaccurate elucidation of RADS category meaning (E4a) and erroneous explanation of the recommended management and follow-up corresponding to the RADS category (E4b).

Hallucination of inappropriate RADS categories (E2a) occurred more frequently with prompt-0 across all 3 chatbots. However, this error rate decreased to zero for Claude-2 when using prompt-2, a trend not seen with GPT-3.5 or GPT-4. A recurrent E2a error in LI-RADS was the obsolete category LR-5V from the 2014 version, now superseded by LR-TIV in subsequent editions [,]. Furthermore, hallucination of invalid RADS criteria (E2b) was more prevalent than that of E2a. For instance, the LI-RADS second question response stating “T2 marked hyperintensity is a feature commonly associated with hepatocellular carcinoma (HCC)” is inaccurate, as T2-marked hyperintensity is characteristic of hemangioma and not hepatocellular carcinoma. Despite initial higher E2b rates, Claude-2 demonstrated a substantial reduction with prompt-2 (105 to 38 instances), exceeding the decrement seen with GPT-4 (71 to 57 instances).

Regarding reasoning error, incorrect RADS category reasoning (E3b) was the most frequent error but decreased for all chatbots with prompt-1 and prompt-2 versus prompt-0. Claude-2 reduced errors by almost half with prompt-2, while the GPT-4 decrease was less pronounced. Lesion signal interpretation errors (E3ai) included misinterpreting hypointensity on diffusion-weighted imaging as “restricted diffusion,” rather than facilitated diffusion. Lesion size reasoning errors (E3aii) occurred in 34 of 1440 cases, predominantly by Claude-2 (25/34, 73.5%), especially in systems such as Lung-RADS and LI-RADS where size is critical for categorization. Examples were attributing a 12-mm pulmonary nodule to the ≥6-mm but <8-mm range, or assigning a hepatic lesion measuring 2.3 cm × 1.5 cm to the 10- to 19-mm category. Reasoning enhancement errors (E3aiii) were exclusive to Claude-2 in O-RADS, where enhancement significantly impacts categorization. Misclassifying images at 40 seconds postcontrast as early or delayed enhancement exemplifies this error.

Explanatory errors (E4) including incorrect RADS category definitions (E4a) and inappropriate management recommendations (E4b) also substantially declined with prompt-1 and prompt-2. For instance, in the first Lung-RADS question response, the statement “The 4X designation indicates infectious/inflammatory etiology is suspected.” is incorrect. Lung-RADS 4X means category 3 or 4 nodules with additional features or imaging findings that increase suspicion of lung cancer [].

DiscussionPrincipal Findings

In this study, we evaluated the performance of 3 chatbots—GPT-3.5, GPT-4, and Claude-2—in categorizing radiological findings according to RADS criteria. Using 3 levels of prompts providing increasing structure, examples, and domain knowledge, the chatbots’ accuracies and consistencies were quantified across 30 cases. The best performance was achieved by Claude-2 when provided with few-shot prompting and the RADS criteria PDFs. Interestingly, the chatbots tended to categorize better for the relatively older LI-RADS version 2018 criteria in contrast to the more recent Lung-RADS version 2022 and O-RADS guidelines published after the chatbots’ training cutoff.

The incorporation of RADS, which standardizes reporting in radiology, has been a significant advancement, although the multiplicity and complexity of these systems impose a steep learning curve for radiologists []. Even for subspecialized radiologists at tertiary hospitals, mastering the numerous RADS guidelines poses challenges, requiring familiarity with the lexicons, regular application in daily practice, and ongoing learning to remain current with new versions. While previous studies have shown that LLMs could assist radiologists in various tasks [-,,], their performance at RADS categorization from imaging findings is untested. We therefore evaluated LLMs for focused RADS categorization of testing cases.

Without prompt engineering (prompt-0), all chatbots performed poorly. However, accuracy improved for all chatbots when provided an exemplar prompt demonstrating the desired response structure (prompt-1). This underscores the use of prompt tuning for aligning LLMs to specific domains such as radiology. Further enriching prompt-1 with the RADS guideline PDFs as a relevant knowledge source (prompt-2) considerably enhanced Claude-2’s accuracy, a feat not mirrored by GPT-4. This discrepancy could stem from ChatGPT’s reliance on an external plug-in to access documents, while Claude-2’s architecture accommodates the direct assimilation of expansive texts, benefiting from its larger-context window and superior long document–processing capabilities.

Notably, we discerned performance disparities across RADS criteria. When queried on older established guidelines such as LI-RADS version 2018 [], the chatbots demonstrated greater accuracy than more recent schemes such as Lung-RADS version 2022 and O-RADS [,,]. Specifically, GPT-4 and Claude-2 had significantly higher total correct ratings for LI-RADS than for Lung-RADS and O-RADS (all P<.05). This could be attributed to their extensive exposure to the voluminous data related to the matured LI-RADS during their pretraining phase. With prompt-2, Claude-2 achieved 75% (45/60) accuracy for overall rating LI-RADS categorization. The poorer performance on newer RADS criteria highlights the need for strategies to continually align LLMs with the most up-to-date knowledge.

A deep dive into the error-type analysis revealed informative trends. Incorrect RADS category reasoning (E3b) constituted the most frequent error across chatbots, decreasing with prompt tuning. Targeted prompting also reduced critical errors such as hallucinations of RADS criteria (E2b) and categories (E2a) likely by constraining output to valid responses. During pretraining, GPT-liked LLMs predict the next word in the unlabeled data set, risking learning fallacious relationships between RADS features. For instance, Lung-RADS version 2022 lacks categories 5 and 6 [], though some other RADS such as Breast Imaging Reporting and Data System include them []. Using prompt-0, chatbots erroneously hallucinated Lung-RADS categories 5 and 6. Explanatory errors (E4) including inaccurate definition of the assigned RADS category (E4a) and inappropriate management recommendations (E4b) also substantially declined with prompt tuning. For instance, when queried on the novel O-RADS criteria with prompt-0, chatbots hallucinated follow-up recommendations from other RADS criteria and responded “O-RADS category 3 refers to an indeterminate adnexal mass and warrants short-interval follow-up.” Targeted prompting appears to mitigate these critical errors such as hallucination and incorrect reasoning. Careful prompt engineering is essential to properly shape LLM knowledge for radiology tasks.

Limitations

There are also several limitations in this study. First, only the LI-RADS CT/MRI and O-RADS MRI were included, excluding LI-RADS ultrasound (US) and O-RADS US guidelines, which are often practiced in an independent US department [,]. Second, the chatbot’s performance was heavily dependent on prompt quality. We test only 3 types of prompts and further prompt strategies studies are warranted to investigate the impact of more exhaustive engineering on chatbots’ accuracy. Third, GPT-4-turbo was released on November 6, 2023, representing the latest GPT-4 model with improvements in instruction following, reproducible outputs, and more []. Furthermore, its training data extend to April 2023 compared with September 2021 for the base GPT-4 model tested here. We are uncertain about this newest GPT-4-turbo model’s performance on the RADS categorization task. Evaluating GPT-4-turbo represents an important direction for future work. Fourth, our study focused on 3 of 9 RADS [], with a limited 10 cases for each RADS category. Although our choice ensured a blend of old and new guidelines and tried to cover all the RADS scores as much as possible, extending evaluations to all the RADS guidelines and incorporating more radiology reports from real clinical scenarios could offer deeper insights into potential limitations. Nonetheless, this initial study highlights critical considerations of prompt design and knowledge calibration required for safely applying LLMs in radiology. Fifth, evaluating the performance of the LLM in comparison with radiologists of varying expertise levels proves valuable for discerning its strengths and weaknesses in real-world applications. This comparative analysis will be undertaken in our forthcoming studies.

Conclusions

When equipped with structured prompts and guideline PDFs, Claude-2 demonstrates potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria. Our study highlights the potential of LLMs in streamlining radiological categorizations while also pinpointing the enhancements necessary for their dependable application in clinical practice for RADS categorization tasks.

This study has received funding from the National Natural Science Foundation of China (82371934 and 82001783) and Joint Fund of Henan Province Science and Technology R&D Program (225200810062). The authors thank Chuanjian Lv, MD; Zejun Wen, MM; and Jianghua Lou, MM, for their help in drafting the radiology reports with regard to Lung CT Screening Reporting and Data System, Liver Imaging Reporting and Data System, and Ovarian-Adnexal Reporting and Data System, respectively.

QW (Henan Provincial People’s Hospital & People’s Hospital of Zhengzhou University), QW (Beijing United Imaging Research Institute of Intelligent Imaging), HL, Y Wang, YB, Y Wu, XY, and MW contributed to study design. QW (Henan Provincial People’s Hospital & People’s Hospital of Zhengzhou University) and QW (Beijing United Imaging Research Institute of Intelligent Imaging) contributed to the statistical analysis. All authors contributed to the acquisition, analysis, or interpretation of the data; the drafting of the manuscript; and critical revision of the manuscript.

QW and PD are senior engineers of Beijing United Imaging Research Institute of Intelligent Imaging and United Imaging Intelligence (Beijing) Co, Ltd. JX and DS are senior specialists of Shanghai United Imaging Intelligence Co, Ltd. The companies have no role in designing and performing the surveillance and analyzing and interpreting the data. All other authors report no conflicts of interest relevant to this article.

Edited by C Lovis; submitted 25.12.23; peer-reviewed by Z Liu, D Bu, TAR Sure, S Nuthakki, L Zhu; comments to author 14.01.24; revised version received 02.02.24; accepted 25.05.24; published 17.07.24.

©Qingxia Wu, Qingxia Wu, Huali Li, Yan Wang, Yan Bai, Yaping Wu, Xuan Yu, Xiaodong Li, Pei Dong, Jon Xue, Dinggang Shen, Meiyun Wang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 17.07.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

View original article

JMIR MEDICAL INFORMATICS

分享书签

0 0 0 0 0 0 0

More from this channel

Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study

留言 (0)