Evaluating the OpenAI’s GPT-3.5 Turbo’s performance in extracting information from scientific articles on diabetic retinopathy

It is critical for policy and clinical guidelines to be based on the best available evidence [5]. Systematic reviews are often regarded as the gold standard for evidence synthesis [8], appraising the latest evidence transparently and objectively by employing established standards [2]. Consequently, poorly conducted systematic reviews can potentially misinform future clinical guidelines and policies [8]. Misinformed clinical guidelines could equip practitioners with inaccurate scientific information and clinical advice, compromising the quality of care [9]. Additionally, misinformed clinical guidelines could potentially encourage ineffective, harmful, or wasteful interventions [9].

In recent years, as the number of primary studies continues to increase, current methods of manual information extraction by researchers for the synthesis of systematic reviews is not sustainable and efficient. Depending on the experience of the researcher and the number of studies selected, conducting a systematic review can take up to 2 years [7]. However, given the surge of artificial intelligence (AI) in recent years, there is potential for it to be a powerful tool to speed up the process of information extraction from each scientific article. In particular, large language models (LLMs), a generative AI, such as the generative pre-trained transformer (GPT), which can process and generate text, may be adopted to expedite the systematic review process. Layering upon LLMs, the retrieval-augmented generation process (RAG), an information retrieval component, can improve the accuracy of information extraction of articles, delivering more precise and contextually relevant responses [3].

Systematic reviews require high accuracy in methods, which may be difficult for AI to attain [6]. One of the challenges of accuracy is the occurrence of potential hallucination, where it produces information that may sound plausible but are either factually incorrect or unrelated to the given context [10]. As such, evaluating the performance of AI tools adopted for this process is critical. In recent times, there has been a lot of research evaluating the performance of an LLM that is open for public use, ChatGPT, in medical research [1]. The accuracy of ChatGPT in answering medical queries regarding different domains such as cancer, liver diseases, and COVID-19 vaccination, has been assessed [1]. The results reported different accuracy ranges of ChatGPT, from 18.3 to 100% [1]. However, the performance of OpenAI’s GPT-3.5 Turbo in extracting information from scientific articles stored in PDF format has not been evaluated.

This study aimed to compare the concordance of information extracted and the time taken between OpenAI’s GPT-3.5 Turbo against conventional human extraction methods in retrieving relevant information from scientific articles on diabetic retinopathy (DR).

留言 (0)

沒有登入
gif