Task-Specific Transformer-Based Language Models in Health Care: Scoping Review

IntroductionBackground

Transformer models have revolutionized natural language processing (NLP) with their exceptional state-of-the-art performance in various applications such as conversation, translation, text classification, and text generation. A transformer model is a type of deep learning model designed to process and generate sequences of data, such as text. The key innovation of transformer models is the self-attention mechanism, which allows the model to weigh the importance of different words in the input sequence, regardless of their position. Self-attention allows the model to focus on different parts of an input sequence simultaneously, rather than processing the sequence in a fixed order. This mechanism enables the model to capture complex patterns and relationships within the context more effectively than previous models, which is particularly useful for understanding and generating natural language. These models hold significant promise for the health care sector, addressing clinical challenges and unlocking new opportunities in medical informatics (eg, disease prediction, clinical decision support, and patient interaction).

Since the introduction of the transformer model by Google [] in 2017, it has become the foundation for various pretrained language models (PLMs). PLMs are transformer models that have been initially trained on a large text corpus before being fine-tuned for specific tasks. This pretraining allows the models to leverage vast amounts of unstructured data to improve their performance in various NLP tasks. Two of the most widely used PLM architectures in medical research are Generative Pre-trained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT). GPT is designed to generate coherent text based on a given input, making it useful for tasks like dialogue generation []. BERT, on the other hand, is designed to understand the context of words in a sentence from both directions, making it highly effective for tasks like question answering and text classification []. Transformer-based language models have revolutionized the field of NLP and continued to advance the state-of-the-art in NLP with their impressive performance.

Despite the success of transformer-based language models in many domains, there is a significant gap in comprehensive reviews specifically focused on their application in the health care domain. In health care, transformer-based language models have been used for crucial tasks such as disease prediction, decision-making, and image analysis []. With the abundance of free text sources, such as medical documentation in free text, including social media, electronic medical records (EMRs), physician-patient conversations, and online encyclopedias, more significant challenges to language models are needed. The application of NLP in health care is not without controversy, particularly concerning data privacy, ethical implications, and the integration of artificial intelligence (AI) systems into clinical practices. Debates continue about the extent to which AI can replace human judgment, the transparency of AI decision-making processes, and the potential biases in AI models trained on unbalanced datasets. By addressing these concerns, our paper contributes to the timely and critical discourse on the responsible deployment of transformer-based language models in health care, emphasizing the need for transparency, fairness, and ethical considerations in AI development.

Objective

The objective of this paper is to provide a comprehensive scoping review of task-specific transformer-based language models in health care. By focusing on models pretrained on medical corpora, we aim to address the gap in existing literature where detailed surveys specifically tailored to health care applications are lacking. We seek to highlight the strengths, limitations, and potential of these models, offering valuable insights for future research and practical applications in medical informatics.

Related Work

While many review studies of NLP have been conducted in the medical field [-], on transformer-based language models [-], and in health-related domains [-], comprehensive surveys and broader and up-to-date transformer-based language models in health care are lacking, leaving a gap in understanding their full potential and limitations. Pandey et al [] introduced RedBERT, a model focusing on topic discovery and deep sentiment classification of COVID-19 online discussions, demonstrating the application of NLP in understanding public health concerns. Iroju and Olaleke [] conducted a systematic review of NLP applications, identifying key areas where NLP can enhance clinical decision-making and patient care. Similarly, Locke et al [] provided a comprehensive overview of NLP in medicine, emphasizing the potential of NLP technologies in transforming medical practice. Adyashreem et al [] surveyed various NLP techniques in the biomedical field, shedding light on how these techniques can be applied to biomedical text for improved information extraction and analysis. Wang et al [] reviewed the application of NLP in clinical medicine, highlighting the advancements and challenges in integrating NLP with clinical workflows.

Khanbhai et al [] applied NLP and machine learning techniques to patient experience feedback, providing insights into patient satisfaction and areas for improvement in health care services. Casey et al [] focused on NLP applications in radiology reports, identifying how NLP can streamline the interpretation and reporting of radiological findings. Zhou et al [] discussed the broader applications of NLP for smart health care, envisioning a future where NLP-driven systems enhance patient care and operational efficiency.

In the realm of transformer-based language models, Zhang et al [] surveyed their applications in bioinformatics, highlighting how these models have advanced the analysis of biological data. Yang [] and Lin et al [] explored the progress and applications of transformer models in Korean and general NLP tasks, respectively, highlighting their growing importance and versatility. Chitty-Venkata et al [] reviewed neural architecture search for transformers, underscoring the potential of these models in optimizing NLP tasks. Gillioz et al [] provided an overview of transformer-based models for various NLP tasks, illustrating their adaptability and efficiency. Han et al [] focused on multimodal pretrained models, emphasizing their capability to handle diverse data types, including text, image, and audio. Greco et al [] and Albalawi et al [] discussed transformer models’ applications in mental health and Arabic social media, respectively, highlighting their potential in understanding and addressing specific health-related issues. Kalyan et al [] and Shamshad et al [] provided comprehensive surveys on biomedical PLMs and their applications in medical imaging, respectively, showcasing the transformative impact of transformers in these fields.

Our review categorizes these models into 6 key tasks: dialogue generation, question answering, summarization, text classification, sentiment analysis, and named entity recognition (NER). Ultimately, advancements in transformer-based language models hold the promise of significantly transforming health care delivery and improving patient outcomes. By enabling more accurate disease prediction, enhancing clinical decision support, and facilitating better patient-provider communication, these models can lead to more efficient, effective, and personalized health care. Our review underscores the broader implications of these technologies, advocating for continued research and development to harness their full potential in revolutionizing medical informatics and patient care.

MethodsInformation Source and Search Strategy

We followed the Cochrane scoping review protocol to conduct and map the available literature in an efficient and systematic approach. This method involves defining the research question, identifying relevant studies, selecting studies based on predefined criteria, charting data, and summarizing the results to clarify key concepts and identify research gaps [].

Our research team (mainly HNC and TJJ) conducted a comprehensive literature review for identifying studies in the field that met the inclusion and exclusion criteria. The screening and selection of papers were conducted by 2 independent reviewers (HNC and TJJ). Initially, titles and abstracts were screened to identify relevant studies. Full texts of potentially eligible studies were then reviewed to ensure they met the inclusion criteria. Disagreements between reviewers were resolved through discussion and consensus, with a third reviewer (YHK) consulted if necessary. Our literature search was conducted across several scientific databases, including Google Scholar and PubMed, which were selected for their comprehensive coverage of relevant journals and peer-reviewed studies in the medical and academic fields. We covered publications from January 01, 2017, to September 30, 2024, and used specific combinations of keywords and Boolean operators, such as “transformer-based AND language models AND medical domain,” “health care AND language models,” “NLP AND medicine AND survey,” and “GPT AND BERT AND health care.” Data extraction involved summarizing key findings, model names, and training datasets. The extracted data were cross-verified by both reviewers to ensure accuracy and consistency. Any discrepancies were resolved through discussion.

We included studies that involved transformer-derived models applied to medical tasks, were published in peer-reviewed journals, and were written in English. The exclusion criteria involved studies focusing solely on non-text data (eg, audio, image, and video) or those not meeting the inclusion requirements. The selection of tasks (dialogue generation, question answering, summarization, text classification, sentiment analysis, and NER) was based on their critical role in advancing health care applications of transformer models. The specific process is illustrated in , with details of each stage of filtering from the initial identification of articles to the final selection. The inclusion criteria were rigorously applied at each step, beginning with the screening of titles and abstracts, followed by a full-text review, and culminating in the inclusion of studies that met all predefined criteria. This methodical approach allowed us to compile a comprehensive and focused set of articles for our scoping review, ensuring that our findings are both robust and reliable.

These tasks cover a wide range of functionalities essential for improving clinical workflows, enhancing patient interactions, and facilitating efficient information retrieval and analysis, making them vital for the advancement of transformer-based language models in the medical domain. Languages and model types were chosen to represent a diverse range of medical contexts and applications.

‎

Figure 1. Article filtering process with inclusion criteria.

In this section, studies that have used language models in health care applications were examined. Based on the literature review, provides a comprehensive list of transformer-based models applied in the medical domain, comparing each task based on the authors, model name, training dataset, PLM model, key metric, score, and purpose or findings of the study. These English-written PLMs in the health care domain were categorized into 6 distinct tasks, namely dialogue generation, question answering, summarization, text classification, sentiment analysis, and NER. The articles within each task are listed in no sequential order. In , the evolution timeline of transformer-based language models provides an overview of significant models that have been developed for use in medicine. It illustrates key milestones and the deployment criteria used to guide the inclusion of studies in our review. This historical context provides a foundation for understanding the methodological choices made in our scoping review. This visual representation highlights the emergence of models over time and their increasing significance in health care applications. We provide insights into the progress made in this field and anticipate future advancements by tracking the development of these models.

Table 1. Summary of the applications of pretrained language models subdivided into tasks.Task and author (year)Model nameTraining datasetPLMa modelKey metricScore (%)Key findingsConversation
Varshney et al [], 2023Medical Entity Prediction (MEP)UMLSBERTAccuracy85Integrated triples from knowledge graphs to enhance medical predictions using a large pretrained model.Yuan et al [], 2022BioBARTPubMedBARTRouge-265Adapted and improved biomedical context understanding through advanced generative techniques.Zhao et al [], 2022MedPIRMedDG, MedDialogBERT, GPTF182Used a knowledge-aware dialogue graph encoder (KDGE) and recall-enhanced generator (REG) to improve clinical responses.Chen et al [], 2023OPALWikipedia, WOZ, CamRest676BARTBLEU21.5Tailored for task-oriented medical dialogues by incorporating domain-specific ontologies.Liang et al [], 2021MKA-BERT-GPTMedDG, MedDialog-CNBERT, GPTRelevance improvement15First scalable model to integrate a medical knowledge graph into a large pretrained model, enhancing biomedical understanding.Compton et al [], 2021MEDCODKB, doctor editsGPT-3Emotive accuracy90Generated diverse, emotive, and empathetic sentences for health care interactions.Li et al [], 2023ChatDoctor5000 doctor-patient conversationsLlaMAPrecision, recall, F183.7, 84.5, 84.1Fine-tuned LLaMa model using tailored doctor-patient dialogues for medical NLPb tasks.Tang et al [], 2023-w terms+ALMedDialogBARTAnnotation accuracy87Automated large-scale medical conversation text annotation with terminology extraction.Zeng et al [], 2020Transformer-DSTMultiWOZBERTDST accuracy54.6Proposed a transformer-based framework using a flat encoder-decoder architecture for dialogue state tracking in medical contexts.Suri et al [], 2021MeDiaBERTMeDiaQABERTAccuracy64.3Employed a hierarchical approach to medical dialogue analysis, including multiple-choice question answering.Phan et al [], 2021SciFivePubMedT5Accuracy86.6A medical T5 text-to-text model effective for various clinical downstream tasks.
Wu et al [], 2023PMC-LLaMAPubMed, 30K Medical BooksLlaMAAccuracy64.43Transitioned a general-purpose model to a high-performing medical language model via comprehensive fine-tuning, achieving state-of-the-art performance in medical question answering.
Zhang et al [], 2023HuatuoGPTHuatuo26MGPTBLEU, ROUGE, distinct25.6, 27.76, 93Chinese health care LLM: Tailored for the Chinese medical domain, providing state-of-the-art results in medical consultation tasks.Question answeringLee et al [], 2019BioBERTPubMed, EHRc, clinical notes, patentsBERTMRR improvement12.24First domain-specific BERT-based model for biomedical text mining, outperforming standard BERT in medical tasks.Luo et al [], 2023BioGPTPubMedGPTAccuracy78.2Pretrained on a 15M PubMed corpus, this model outperforms GPT-2 in biomedical text generation.Shin et al [], 2020BioMegatronWikipedia, news, OpenWebtextMegatron-LMBias40Enhanced the representation of biomedical entities across a large corpus for better entity understanding.Rasmy et al [], 2020MED-BERTCerner Health Facts, TruvenBERTAUCd20 boostsFirst proof-of-concept BERT model for integrating electronic health records.Yasunaga et al [], 2022LinkBERTWikipediaBERTImprovement5Effective in multi-hot reasoning and few-shot question answering by linking documents.Michalopoulos et al [], 2020UmlsBERTMIMIC-IIIBERTF186Learned the association of clinical terms within the UMLS metathesaurus.Zhang et al [], 2021SMedBERTChineseBLUEBERTAccuracy78Introduced a mention-neighbor hybrid attention model for heterogeneous medical entity information.Yang et al [], 2022ExKidneyBERTEMReBERTAccuracy95.8A specialized model focused on renal transplant-pathology integration.Mitchell et al [], 2021CaBERTnetPathology reportsBERTAccuracy85An automatic system for extracting tumor sites and histology information.Trieu et al [], 2021BioVAEPubMedSciBERT, GPT2VAE72.9First large-scale pretrained language model using the OPTIMUS framework in the biomedical domain.Khare et al [], 2021MMBERTCOntext (ROCO)BERTAccuracy72Proposed masked language modeling for radiology text representations.Yasunaga et al [], 2022BioLinkBERTWikipedia, Book CorpusBERTBLURB84A novel linking method for predicting document relations in pretraining models.Nguyen et al [], 2022SPBERTQAViHealthQASBERTMean average precision69.5A 2-step question answering system addressing linguistic disparities with BM25 and Sentence BERT.
Luo et al [], 2023BioMEDGPTPubMedGPTAccuracy76.1First multimodal GPT capable of aligning biological modalities with human language for medical text analysis.
Toma et al [], 2023Clinical CamelPubMed, USMLE, MedMCQALlaMA-2Five-shot accuracy74.3, 54.3, 47.0A model that outperforms GPT-3.5 by using efficient fine-tuning techniques.
Han et al [], 2023MedAlpacaMedical flash cards, WikidocAlpacaAccuracy21.1-24.1Highlighted privacy protection in medical artificial intelligence and demonstrated significant performance enhancements in medical certification exams through fine-tuning.
Singhal et al [], 2023MedPaLM-2MedMCQA, MedQA, PubMedQA, MMLUPaLMAccuracy67.6Instruction prompt tuning undergoes rigorous human evaluation to assess harm avoidance, comprehension, and factual accuracy.
Chen et al [], 2023MEDITRONPubMEDLlaMA-2Accuracy79.8Achieved 6% improvement over the best public baseline and 3% gain over fine-tuned Llama-2 models.SummarizationYan et al [], 2022RadBERTOpen-I chest radiograph reportBERTAccuracy, F197.5, 95Adapted a bidirectional encoder representation for radiology text.Du et al [], 2020BioBERTSumPubMedBERTROUGE-L68Introduced the first transformer-based model for extractive summarization in the biomedical domain.Li et al [], 2022Clinical-Longformer & Clinical-Big BirdMIMIC-IIILongformer, Big BirdF197Reduced memory usage through sparse attention in a long-sequence transformer.Moro et al [], 2022DAMENMS2BERT, BARTAccuracy75Developed a multi-document summarization method using token probability.Chen et al [], 2020AlphaBERTHER (NTUH-iMD)BERTAccuracy69.3Designed a diagnoses summarization model based on character-level tokens.Alsentzer et al [], 2019Bio+Clinical BERTMIMIC-IIIBERTF111 improvementsReleased the first BERT-based model specifically for clinical text.Cai et al [], 2021ChestXRayBERTMIMICBERTAccuracy73Automatically generates abstractive summarization of radiology reports.Yalunin et al [], 2022LF2BERTUMLS, EHRBERTROUGE-1 F1, ROUGE-2 F1, ROUGE-L F167, 56.4, 64.5Developed a neural abstractive model for summarizing long medical texts.
Balde et al [], 2024MEDVOCPubMed, BioASQ, EBMGPTROUGE51.49, 47.54, 19.51Efficiently reduced fine-tuning time and improved vocabulary adaptation for medical texts.Text classificationYang et al [], 2022GatorTronClinical notes, UF Health clinical corpus MIMIC-III, PubMed, WikipediaGPTPearson correlation89Outperformed previous biomedical and clinical domain models.Gu et al [], 2021PubMEDBERTPubMedBERTBLURB81.2Established a leaderboard for biomedical NLP, with robustness against noisy and incomplete biomedical text.Huang et al [], 2020ClinicalBERTEHR (clinical notes)BERTAccuracy, precision, recall, AUROCf, AUPRCg72.7, 37.6, 54.2, 74.2, 42.0Introduced “catastrophic forgetting prevention” and generated visualized interpretable embeddings.Gupta et al [], 2022MatSciBERTWikipedia, clinical database, Book CorpusBERTF181.5Effective transformer model for scientific text analysis.Fang et al [], 2023BioformerPubMed, PMCBERTPerformance, speed60 reduced model size, 2-3× speedReduced model size by 60% for biomedical text mining.Gururangan et al [], 2020BioMed-RoBERTaCHEMPROT, PubMedRoBERTaF183.4Proposed domain and task-adaptive pretraining with a data selection strategy.Liao et al [], 2023Mask-BERTPubMed, NICTA-PIBOSO, symptomsBERTAccuracy, F1, PR-AUCh91.8, 89.6, 93.1Improved a BERT-based model for multiple tasks with masked input text.He et al [], 2022KG-MTT-BERTEHRBERTAccuracy82Developed a model for multi-type medical tests using a knowledge graph.
Yang et al [], 2023TransformEHREHRBERTAUROC, AUPRC81.95, 78.64Set a new standard in clinical disease prediction using longitudinal EHRs.
Pedersen et al [], 2023MeDa-BERTEMRBERTAccuracy86.7-97.1Tailored embeddings for Danish medical text processing.
Hong et al [], 2023SCHOLARBERTPublic resourceBERTF185.49Leveraged a public resource-driven dataset for scientific NLP.
Abu Tareq Rony et al [], 2024MediGPTIllness datasetGPTAccuracy, F190.0, 88.7Improved medical text classification tasks showing performance gains up to 22.3% compared to traditional methods.Sentiment analysisJi et al [], 2021MentalBERT/MentalRoBERTaRedditBERT, RoBERTaF1, recall81.76, 81.82A pretrained masked model designed for mental health detection.Taghizadeh et al [], 2021SINA-BERTSelf-gathered collection of texts from online sourcesBERTPrecision, recall, macro F1, accuracy94.91, 94.63, 94.77, 96.14Developed a pretrained language model for the Persian medical domain.AlBadani et al [], 2022SGTNSemEval, SST2, IMDB, YelpBERTAccuracy80Proposed the first sentiment analysis model using a transformer-based graph algorithm.Pandey et al [], 2021RedBERTRedditBERTAccuracy86.05Introduced a sentiment classification method from web-scraped data.Palani et al [], 2021T-BERTTwitterBERTAccuracy90.81Designed a sentiment classification method for microblogging platforms.Mao et al [], 2022AKI-BERTMIMIC-IIIBERTAUC, precision, recall/sensitivity, F1, specificity, negative predictive value74.7, 35.6, 61.9, 45.2, 76.8, 90.7Created a BERT model for predicting acute kidney injury.Chaudhary et al [], 2020TopicBERTOhsumedBERTCost optimization70Improved computational efficiency by combining topic and language models for fine-tuning.Qudar et al [], 2020TweetBERTNCBI, BC5CDR, BIOSSES, MedNLI, Chemprot, GAD, JNLPBABERTF187.1Achieved state-of-the-art performance on biomedical datasets using Twitter data for pretraining.Wouts et al [], 2021BelabBERTDBRDRoBERTAccuracy95.9Developed a Dutch language model for psychiatric disease classification.Named entity recognitionLi et al [], 2020BEHRTEHRBERTAccuracy81Interpretable model for multi-heterogeneous medical concepts.Shang et al [], 2019G-BERTEHRBERTJaccard, PR-AUC, F145.7, 69.6, 61.5The first pretraining method for medication recommendation in the medical domain.Lentzen et al [], 2022BioGottBERTWikipedia, drug leaflets from AMIce, LIVIVORoBERTa, GottBERTAccuracy78Introduced the first transformer model for German medical texts.Davari et al [], 2020TIMBERTPubMedBERTPrecision, recall, F190.5, 91.2, 90.9Developed a BERT-based model for automated toponym identification.Peng et al [], 2019BlueBERTPubMed, MIMIC-IIIBERTMasked token score77.3Demonstrated strong generalization ability across biomedical texts and cross-lingual tasks.Miolo et al [], 2021ELECTRAMedNCBIBERTPrecision, recall, F185.9, 89.3, 87.5The first ELECTRA-based model for the biomedical domain.Khan et al [], 2020MT-BioNERBC2GM, BC5CDR, NCBI-DiseaseBERTPrecision, recall, F188.4, 90.52, 89.5A multi-task transformer model for slot tagging in the biomedical domain.Naseem et al [], 2020BioALBERTPubMed, PMCBERTPrecision, recall, F197.4, 94.4, 95.9Trained on large biomedical corpora using ALBERT for biomedical text mining.Yang et al [], 2021BIBCTextbooks, research papers, clinical guidelinesBERTAccuracy78Designed a new architecture for processing long text inputs in diabetes literature.Martin et al [], 2020CamemBERTWikipediaRoBERTaAccuracy85.7Developed the first monolingual RoBERTa model for French medical text.Kraljevic et al [], 2021MedGPTEHRGPTPrecision64Efficiently handled noise in EHR data using NER and MedCAT.Li et al [], 2019EhrBERTEHRBERTF193.8Proposed an entity normalization technique for 1.5 million EHR notes.
Gwon et al [], 2024HeartBERTEMRBERTAccuracy74Emphasized the importance of department-specific language models, with a focus on cardiology.
Mannion et al [], 2023UMLS-KGI-BERTUMLSBERTPrecision85.05Introduced a graph-based learning method with masked-language pretraining for clinical text extraction.
Schneider et al [], 2023CardioBERTptEHRBERTFL-score83Specialized in extracting Portuguese cardiology terms, demonstrating that data volume and representation improve NER performance.
Saleh et al [], 2024TocBERTMIMIC-IIIBERTF184.6Outperformed a rule-based solution in differentiating titles and subtitles for a discharge summary dataset.

aPLM: pretrained language model.

bNLP: natural language processing.

cEHR: electronic health record.

dAUC: area under the curve.

eEMR: electronic medical record.

fAUROC: area under the receiver operating characteristic curve.

gAUPRC: area under the precision-recall curve.

hPR-AUC: precision-area under curve.

‎

Figure 2. Timeline of significant transformer-based models in health care. EHR: electronic health record; NER: named entity recognition; NLP: natural language processing.
ResultsSelected Studies

A total of 75 models were identified through our comprehensive review. The PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flowchart is presented in . The PRISMA checklist is presented in . These papers encompass various research areas related to transformer-based models and their applications in the medical domain. The selection of these papers was based on predefined inclusion criteria, ensuring the relevance of each study to the scope of our review.

‎

Figure 3. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram for the review process. Applications of Language Models in Health Care: Task-SpecificDialogue Generation

Conversation generation generates responses to a given dialogue. GPT models, including DialoGPT and DialogBERT, can effectively generate human-like dialogues based on large corpora and contextualized representations of text [,-]. In the medical domain, conversation generation focuses on developing conversations related to medical information []. Chatbots in health care can be classified into 6 types: screening and diagnosis, treatment, monitoring, support, workflow efficiency, and health promotion. These tasks involve aiding patient consultation, acting as a physician’s decision support system, collaborating with interdisciplinary research, and providing care instructions and medical education [,].

The key models are MEP, BioBART, MedPIR, MEDCOD, Transformer-DST, MeDiaBERT, ChatDoctor, and SciFive.

Research efforts on conversation generation in medicine have also incorporated knowledge graphs. MKA-BERT-GPT was the first scalable work to integrate a medical knowledge graph mechanism into a large pretrained model. Meanwhile, MedPIR proposed a recall-enhanced generator framework by using a knowledge-aware dialogue graph encoder to strengthen the relationship between the user input and the response via past conversation information [,]. They achieved an F1 score of 82% and a bilingual evaluation understudy (BLEU) score of 21.5. Varshney et al [] proposed the Masked Entity Dialogue (MED) model to train smaller corpora texts, addressing a 10% improvement in entity prediction accuracy for the problem of local embeddings in entity embeddings by incorporating conversation history into triples in the graphs, resulting in an automatic prediction of the medical entities model.

On the other hand, MEDCOD [] used the GPT pretrained model to integrate emotive and empathetic aspects into the output sentences, which further imitates a human physician–like feature to better communicate with patients. The Transformer-DST [] model addresses dialogue state tracking, optimizing state operation prediction, and value generation with high accuracy by suggesting to ask the DST model to consider the whole dialogue and the previous state. Moreover, MeDiaBERT [], using a hierarchical approach, achieves 63.8% accuracy in multiple choice medical queries by building a transformer encoder contained within another in a hierarchical manner.

BioBART [], a BART-based model, used patient descriptions and conversation histories as input for the model to autoregressively generate replies to user inputs. The model outperformed the BART model by 1.71 on Rouge-2 with a BLEU score of 4.45, and pretraining on PubMed abstracts supported the model’s performance. OPAL [] and -w terms+AL models also used BART for pretraining. OPAL’s proposed method involves 2 phases: pretraining on large-scale contextual texts with structured information extracted using an information-extracting tool and fine-tuning the pretrained model on task-oriented dialogues. The results showed a significant performance boost, overcoming the problems created by annotated data with large structured dialogue data. Recently, the -w terms+AL model proposed a framework for improving dialogue generation by incorporating domain-specific terminology through an automatic terminology annotation framework using a self-attention mechanism [].

While other models are based on BERT, GPT, or BART, ChatDoctor [], SciFive [], and PMC-LLaMA [] use LLaMA or T5 PLMs. To improve accuracy and provide informed advice in medical consultations, ChatDoctor used Meta’s open-source LLaMA [], which was fine-tuned using real-world patient-physician conversations and autonomous knowledge retrieval capabilities, achieving 91.25% accuracy. SciFive, a Text-To-Text Transfer Transformer–based model, was pretrained on large biomedical corpora, indicating its significant potential for learning large and extended outputs. The SciFive model was trained using a maximum likelihood objective with “teacher forcing” [] for multi-task learning by leveraging task-specific tokens in the input sequence. Both models outperformed previous baseline methods.

More recently, the HuatuoGPT model [], specifically tailored for the Chinese medical domain, provided state-of-the-art results in medical consultation tasks.

Question Answering

The question-answering task involves answering questions posed by users based on the texts in documents. It aims to generate an accurate response that directly answers the question input, contributing to clinical decision-making, medical education, and patient communication. Allowing physicians and researchers to obtain valuable answers quickly from electronic health records (EHRs) and various medical literature will effectively reduce the time and effort required when the procedure is done manually. While the dialogue generation and question-answering tasks both involve providing answers, the former focuses on generating responses within a conversation, whereas the latter focuses on developing specific answers to user questions.

The key models are BioBERT, BioGPT, BioMegatron, Med-BERT, UmlsBERT, SMedBERT, and BioVAE.

BERT-based language models have become increasingly popular in biomedical text mining as they can understand the context and generate accurate predictions. BioBERT [], the first domain-specific BERT-derived transformer language model for biomedical text mining applications, achieved 89% accuracy on the MedQA dataset and outperformed BERT in medical text applications. A BioMegatron model [], based on Megatron-LM [], was also experimented on a question-answering task, and it was found that the domain and task-specific language model affected the overall performance rather than the model size. Shin et al [] found that model size is not closely related to the performance rate, but rather the domain and task-specific language model affects the overall performance. Med-BERT [], another BERT-inspired model, improved the prediction accuracy by 20% in disease prediction studies by pretraining on EHR datasets.

More recently, researchers have built BERT-based models for specific domains and tasks []. UmlsBERT [] first built a semantic embedding linking concepts with the words in the UMLS Metathesaurus and proposed multi-label loss function–masked modeling. SMedBERT [] also presented a similar approach with the knowledge semantic representation but structured the neighboring entities to learn heterogeneous information. UmlsBERT and SMedBERT enhanced performance, with F1 scores of 84% and 86%, respectively. Similarly, LinkBERT and BioLinkBERT [] incorporated ontological knowledge to better understand a linking system between entities in the corpus. LinkBERT used a multi-task learning framework on several related tasks simultaneously to extract relations between entities in the corpus more effectively. ExKidneyBERT, CaBERTnet, and MMBERT extracted more precise answers from individual departmental reports [,,].

On the other hand, BioVAE [] used the OPTIMUS framework pretrained with SciBERT [] and GPT-2 [,] and outperformed the baseline models on biomedical text mining. To address the issues on linguistic disparity, SPBERTQA [] proposed a 2-stage multilingual language model pretrained on the SBERT model [] to reply to user questions using multiple negative ranking losses with Bert multilingual 25.

However, previous studies using the BERT structure are a better fit for understanding the context, rather than generating texts. To this end, BioMedLM, a GPT architecture model, was built mainly for biomedical question-answering tasks [] in recent studies of question-answering benchmarks and achieved 50% accuracy on summarizations of the patient’s quest even in real situations with fewer data. BioGPT [] applied a 2-step fine-tuning method to remove the noise in data and achieved 6.0% improved results compared with BioLinkBERT in the medical domain for question-answering tasks.

Recent studies have introduced significant advancements, such as BioMEDGPT [], the first multimodal GPT for aligning biological data with human language, achieving 76.1% accuracy. Clinical Camel [], using LLaMA-2, demonstrated superior performance with 5-shot accuracy ranging from 47.0% to 74.3%, outperforming GPT-3.5. MedAlpaca [] focused on privacy and medical certifications, attaining 21.1%-24.1% accuracy. MedPaLM-2 [] reached 67.6% accuracy through instruction prompt tuning, and MEDITRON [] achieved 79.8% accuracy, marking a 6% improvement upon existing models and setting a new benchmark.

Summarization

For many years, the medical field has suffered from the challenge of finding efficient and rapid access to understanding the fast-growing and immensely increasing amount of data formation. The key to timely and efficient clinical workflow is providing automatic summarization in clinical text. Summarization in health care is an important technique in NLP as it automatically summarizes the medical contexts into a concise summary of text. Summarization can be applied to medical records, literature, clinical trial reports, and other types of medical texts that aim to provide clinical providers with quick access to relevant information, without the need to skim through lengthy documents. Overall summarization can aid clinicians with decision-making through effective and prompt communication during the physician-patient meeting, as well as knowledge discovery for medical research [].

The key models are BioBERTSum, AlphaBERT, ClinicalBertSum, ChestXRayBERT, RadBERT, LF2BERT, and DAMEN.

To alleviate the problems of biomedical literature summarization, which can have difficulties in learning sentence and document-level features, Du et al [] proposed the first PLM for medical extractive summarization application called BioBERTSum. BioBERTSum captures a domain-aware token and sentence-level context by using a sentence position embedding mechanism that inserts structural information into a vector representation. It achieved a ROUGE-L score of 0.68, outperforming standard BERT models. AlphaBERT [] proposed a diagnostic summary extractive model using a character-level token to reduce the model size and achieved a ROUGE-L score of 0.693, reducing the burden of physicians in the emergency department regarding reading complex discharge notes of patients.

To better use clinical notes, ClinicalBertSum [] used the ClinicalBERT, SciBERT, and BertSum models during the fine-tuning and summarization process to automatically extract summaries from clinical abstracts. Similarly, ChestXRayBERT used BERT to perform an automatic abstractive summarization on radiology reports [], with ROUGE-1 scores of 0.70 and 0.73, respectively. RadBERT [], which was fine-tuned for radiology report summarization, achieved 10% fewer annotated sentences during the training, demonstrating the benefit of domain-specific pretraining to increase the overall performance.

LF2BERT [] applied a Longformer neural network and BERT in an encoder-decoder framework to process longer sequence inputs and performed better than human summarization, according to doctors’ evaluations. DAMEN [] used BERT together with BART to discriminate important topic-related sentences in summarization, outperforming previous methods to summarize multiple medical literature via the token probability distribution method. The proposed probabilistic method selected only related significant chunks of information and then provided the probabilities of the tokens within the chunk, rather than the sentence level, to effectively reduce redundancy. Moreover, to overcome the long sequence issue, Li et al [] comparably proposed Clinical-Longformer and Clinical-Big Bird pretrained on the Longformer [] and Big Bird [] models, respectively. Both proposed models used sparse attention mechanisms and linear level sequence lengths to mitigate memory consumption, thus increasing long-term depend

View original article

JMIR MEDICAL INFORMATICS

分享书签

0 0 0 0 0 0 0

More from this channel

Task-Specific Transformer-Based Language Models in Health Care: Scoping Review

留言 (0)