Medical Text Simplification Using Reinforcement Learning (TESLEA): Deep Learning–Based Text Simplification Approach


IntroductionBackground

Research from the field of biomedicine contains essential information about new clinical trials on topics related to new drugs and treatments for a variety of diseases. Although this information is publicly available, it often has complex medical terminology, making it difficult for the general public to understand. One way to address this problem is by converting the complex medical text into a simpler language that can be understood by a wider audience. Although manual text simplification (TS) is one way to address the problem, it cannot be scaled to the rapidly expanding body of biomedical literature. Therefore, there is a need for the development of natural language processing approaches that can automatically perform TS.

Related StudiesTS Approaches

Initial research in the field of TS focused on lexical simplification (LS) [,]. An LS system typically involves replacing complex words with their simpler alternatives using lexical databases, such as the Paraphrase Database [], WordNet [], or using language models, such as bidirectional encoder representations from transformers (BERT) []. Recent research defines TS as a sequence-to-sequence (seq2seq) task and has approached it by leveraging model architectures from other seq2seq tasks such as machine translation and text summarization [-]. Nisioi et al [] proposed a neural seq2seq model, which used long short-term memories (LSTMs) for automatic TS. It was trained on simple-complex sentence pairs and showed through human evaluations that the TS system–generated outputs ultimately preserved meaning and were grammatically correct []. Afzal et al [] incorporated LSTMs to create a quality-aware text summarization system for medical data. Zhang and Lapata [] developed an LSTM-based neural encoder-decoder TS model and trained it using reinforcement learning (RL) to directly optimize SARI [] scores along with a few other rewards. SARI is a widely used metric for automatic evaluation of TS.

With the recent progress in natural language processing research, LSTM-based models were outperformed by transformer []-based language models [-]. Transformers follow an encoder-decoder structure with both the encoder and decoder made up of L identical layers. Each layer consists of 2 sublayers, one being a feed-forward layer and the other a multihead attention layer. Transformer-based language models, such as BART [], generative pretraining transformer (GPT) [], and text-to-text-transfer-transformer [], have achieved strong performance on natural language generation tasks such as text summarization and machine translation.

Building on the success of transformer-based language models, recently Martin et al [] introduced multilingual unsupervised sentence simplification (MUSS) [], a BART []-based language model, which achieved state-of-the-art performance on TS benchmarks by training on paraphrases mined from CCNet [] corpus. Zhao et al [] proposed a semisupervised approach that incorporated the back-translation architecture along with denoising autoencoders for the purpose of automatic TS. Unsupervised TS is also an active area of research but has been primarily limited to LS. However, in a recent study, Surya et al [] proposed an unsupervised approach to perform TS at both the lexical and syntactic levels. In general, research in the field of TS has been focused mostly on sentence-level simplification. However, Sun et al [] proposed a document-level data set (D-wikipedia) and baseline models to perform document-level simplification. Similarly, Devaraj et al [] proposed a BART []-based model that was trained using unlikelihood loss for the purpose of paragraph-level medical TS. Although their training regime penalizes the terms considered “jargon” and increases the readability, the generated text has lower quality and diversity []. Thus, the lack of document- or paragraph-level simplification makes this an important work in the advancement of the field.

TS Data Sets

The majority of TS research uses data extracted from Wikipedia and news articles [,,]. These data sets are paired sentence-level data sets (ie, for each complex sentence, there is a corresponding simple sentence). TS systems have heavily relied on sentence-level data sets, extracted from regular and simple English Wikipedia, such as WikiLarge [], because they are publicly available. It was later shown by Xu [] that there are issues with data quality for the data sets extracted from Wikipedia. They proposed the Newsela corpus, which was created by educators who rewrote news articles for different school-grade levels. Automatic sentence alignment methods [] were used on the Newsela corpus to create a sentence-level TS data set. Despite the advancements in research on sentence-level simplification, there is a need for TS systems that can simplify text at a paragraph level.

Recent work has focused on the construction of document-level simplification data sets [,,]. Sun et al [] constructed a document-level data set, called D-Wikipedia, by aligning the English Wikipedia and Simple English Wikipedia spanning 143,546 article pairs. Although there are many data sets available for sentence-level TS, data sets for domain-specific paragraph-level TS are lacking. In the field of medical TS, Van den Bercken et al [] constructed a sentence-level simplification data set using sentence alignment methods. Recently, Devaraj et al [] proposed the first paragraph-level medical simplification data set, containing 4459 simple-complex pairs of text, and this is the data set used for the analysis and baseline training in this study. A snippet of a complex paragraph and its simplified version from the data set proposed by Devaraj et al [] is shown in . The data set is open sourced and publicly available [].

Figure 1. Complex medical paragraph and the corresponding simple medical paragraph from the dataset. View this figureTS Evaluation

The evaluation of TS usually falls into 2 categories: automatic evaluations and manual (ie, human) evaluations. Because of the subjective nature of TS, it has been suggested that the best approach is to perform manual evaluations, based on criteria such as fluency, meaning preservation, and simplicity []. Automatic evaluation metrics most commonly used include readability indices such as Flesch-Kincaid Reading Ease [], Flesch-Kincaid Grade Level (FKGL) [], Automated Readability Index (ARI), Coleman-Liau index, and metrics for natural language generation tasks such as SARI [] and BLEU [].

Readability indices are used to assign a grade level to text signifying its simplicity. All the readability indices are calculated using some combination of word weighting, syllable, letter, or word counts, and are shown to measure some level of simplicity. Automatic evaluation metrics, such as BLEU [] and SARI [], are widely used in TS research, with SARI [] having specifically been developed for TS tasks. SARI is computed by comparing the generated simplifications with both the source and target references. It computes an average of F1-score for 3 n-gram overlap operations: additions, keeps, and deletions. Both BLEU [] and SARI [] are n-gram–based metrics, which may fail to capture the semantics of the generated text.

Objective

The aim of this study is to develop an automatic TS approach that is capable of simplifying medical text data at a paragraph level, with the goal of providing greater accessibility of biomedical research. This paper uses RL-based training to directly optimize 2 properties of simplified text: relevance and simplicity. Relevance is defined as simplified text that retains salient and semantic information from the original article. Simplicity is defined as simplified text that is easy to understand and lexically simple. These 2 properties are optimized using TS-specific rewards, resulting in a system that outperforms previous baselines on Flesch-Kincaid scores. Extensive human evaluations are conducted with the help of domain experts to judge the quality of the generated text.

The remainder of the paper is organized as follows: The “Methods” section provides details on the data set, the training procedure, and the proposed model, and describes how automatic and human evaluations were conducted to analyze the outputs generated by the proposed model (TESLEA). The “Results” section provides a brief description of the baseline models and the results obtained by conducting automatic and manual evaluation of the generated text. Finally under the “Discussion” section, we highlight the limitations, future work, and draw conclusions.


MethodsModel Objective

Given a complex medical paragraph, the goal of this work is to generate a simplified paragraph that is concise and captures the salient information expressed in the complex text. To accomplish this, an RL-based simplification model is proposed, which optimizes multiple rewards during training, and is tuned using a paragraph-level medical TS data set.

Data Set

The Cochrane Database of Scientific Reviews is a health care database with information on a wide range of clinical topics. Each review includes a plain language summary (PLS) written by the authors who follow guidelines to structure the summaries. PLSs are supposed to be clear, understandable, and accessible, especially for a general audience not familiar with the field of medicine. PLSs are highly heterogeneous in nature, and are not paired (ie, for every complex sentence there may not be a corresponding simpler version). However, Devaraj et al [] used the Cochrane Database of Scientific Reviews data to produce a paired data set, which has 4459 pairs of complex-simple text, with each text containing less than 1024 tokens so that it can be fed into the BART [] model for the purpose of TS. The pioneering data set developed by Devaraj et al [] is used in this study for training the models and is publicly available [].

TESLEA: TS Using RLModel and Rewards

The TS solution proposed for the task of simplifying complex medical text uses an RL-based simplification model, which optimizes multiple rewards (relevance reward, Flesch-Kincaid Grade rewards, and lexical simplicity rewards) to achieve a more complete and concise simplification. The following subsections introduce the computation of these rewards, along with the training procedure.

Relevance Reward

Relevance reward measures how well the semantics of the target text is captured in its simplified version. This is calculated by computing the cosine similarity between the target text embedding (ET) and the generated text embedding (EG). BioSentVec [], a text embedding model trained on medical documents, is used to generate the text embeddings. The steps to calculate the relevance score are depicted in Algorithm 1.

The RelevanceReward function takes 3 arguments as input, namely, target text (T), generated text (G), and the embedding model (M). The function ComputeEmbedding takes the input text and embedding model (M) as input and generates the relevant text embedding. Finally, cosine similarity between generated text embedding (EG) and target text embedding (ET) is calculated to get the reward (Algorithm 1, line 4).

Flesch-Kincaid Grade Reward

FKGL refers to the grade level that must be attained to comprehend the presented information. A higher FKGL score indicates that the text is more complex, and a lower score indicates that the text is simpler. The FKGL for a text (S) is calculated using equation 1 []:

FKGL(S) = 0.38 × (total words/total sentences) + 1.8 × (total syllables/total words) – (15.59) (1)

The FKGL reward (RFlesch) is designed to reduce the complexity of generated text and is calculated as presented in Algorithm 2.

In Algorithm 2, the function FleschKincaidReward takes 2 arguments as inputs, namely, generated text (G) and target text (T). The FKGLScore function calculates the FKGL for the given text. Once the FKGL for T and G is calculated, the Flesch-Kincaid reward (RFlesch) is calculated as the relative difference between r(T) and r(G) (Algorithm 2, line 4), where r(T) and r(G) denote the FKGL of the target and generated text.

Lexical Simplicity Reward

Lexical simplicity is used to measure whether the words in the generated text (G) are simpler than the words in the source text (S). Laban et al [] proposed a lexical simplicity reward that uses the correlation between word difficulty and word frequency []. As word frequency follows zipf law, Laban et al [] used it to design the reward function, which involves calculating zipf frequency of newly inserted words, that is, Z(G – S), and deleted words, that is, Z(S – G). The lexical simplicity reward is defined in the same way as proposed by Laban et al [] and is described in Algorithm 3. The analysis of the data set proposed by Devaraj et al [] revealed that 87% of simple and complex pairs have a value of ΔZ(S, G) ≈ 0.4, where ΔZ(S, G) = Z(G – S) – Z(S – G) is the difference between the zipf frequency of inserted words and deleted words, with the value of lexical reward (Rlexical) scaled between 0 and 1.

In Algorithm 3, LexicalSimplicityReward requires the source text (S) and the generated text (G) as the inputs. Functions ZIPFInserted [] and ZIPFDeleted [] calculate the zipf frequency of newly inserted words and the deleted words. Finally, the lexical reward (Rlexical) is calculated and normalized, as described in line 5.

Training Procedure and Baseline ModelPretrained BART

The baseline language model used in this study for performing simplification was BART [], which is a transformer based encoder-decoder model that was pretrained using a denoising objective function. The decoder part of the model is autoregressive in nature, making it more suitable for sentence-generation tasks. Furthermore, the BART model achieves strong performance on natural language generation tasks such as summarization, which was demonstrated on XSum [] and CNN/Daily Mail [] data sets. In this case, a version of BART fine-tuned on XSUM [] data set is being used.

Language Model Fine-tuning

Transformer-based language models are pretrained on a large corpus of text and later fine-tuned on a downstream task by minimizing the maximum likelihood loss (Lml) function []. Consider a paired data set C, where each instance consists of a source sentence containing n tokens x = and target sequence containing m tokens y = , then the Lml function is given in equation 2 with the computation described in Algorithm 4.

where θ represents the model parameters and y<t denotes preceding tokens before the position t [].

However, the results obtained by minimizing Lml are not always optimal. There are 2 main reasons for the degradation of results. The first is called “exposure bias” [], which occurs when the model expects gold-standard data at each step of training, but does not receive appropriate supervision during testing, resulting in an accumulation of errors during prediction. The second is called “representation collapse” [], which is a degradation of the pretrained language model representations during fine-tuning. Ranzato et al [] avoided the problem of exposure bias by directly optimizing the specific discrete metric instead of minimizing the Lml with the help of an RL-based algorithm called REINFORCE []. A variant of REINFORCE [] called Self-Critical Sequence Training [] was used in this study to directly optimize certain rewards specifically designed for TS; more information on this is provided in the following subsection.

Self-critical Sequence Training

TS can be formulated as an RL problem, where the “agent” (language model) interacts with the environment to take “action” (next word prediction) based on a learned “policy” (pθ) defined by model parameters θwhile observing some rewards (R). In this work, BART [] was used as the language model, and the REINFORCE [] algorithm was used to learn an optimal policy that maximizes rewards. Specifically, REINFORCE was used with a baseline to stabilize the training procedure using an objective function (Lpg) with a baseline reward b (equation 3):

where pθ(yis|...) denotes the probability of the ith word conditioned on a previously generated sampled sequence by the model; r(ys) denotes the reward computed for a sentence generated using sampling; denotes the source sentence, and n is the length of the generated sentence. Rewards are computed as a weighted sum of the relevance reward (Rcosine), RFlesch, and lexical simplicity reward (Rlexical; ) and are given by:

where α, β, and d are the weights associated with the rewards, respectively.

To approximate the baseline reward, Self-Critical Sequence Training [] was used. The baseline was calculated by computing reward values for a sentence that has been generated using greedy decoding r(y*) by the current model and its computation is described in Algorithm 5. The loss function is defined in equation 5:

where y* denotes the sentence generated using greedy decoding. More details on greedy decoding are described in (see also [,,,,,-]).

Figure 2. Compute Rewards function calculates a weighted sum of three rewards: Fkgl Reward, Lexical Simplicity Reward, Relevance Reward. View this figure

Intuitively, by minimizing the loss described in equation 5, the likelihood of choosing the samples sequence (ys) is promoted if the reward obtained for sampled sequence, r(ys), is greater than the reward obtained for the baseline rewards, that is, the samples that return higher reward than r(y*). The samples that obtain a lower reward are subsequently suppressed. The model is trained using a combination of Lml and policy gradient loss similar to []. The overall loss is given as follows:

L = γLpg + (1 – γ)Lml (6)

where γ is a scaling factor that can be tuned.

Summary of the Training Process

Overall, the training procedure follows a 2-step approach. As the pretrained BART [] was not trained on the medical domain–related text, it was first fine-tuned on the document-level paired data set [] by minimizing the Lml (maximum likelihood estimation [MLE]; equation 2). In the second part, the fine-tuned BART model was trained further using RL. The RL procedure of TESLEA involves 2 steps: (1) the RL step and (2) the MLE optimization step, which are both shown in and further described in Algorithm 6. The given simple-complex text pairs are converted to tokens as required by the BART model. In the MLE step, these tokens are used to compute logits from the model, and then finally MLE loss is computed. In the RL step, the model generates simplified text using 2 decoding strategies: (1) greedy decoding and (2) multinomial sampling. Rewards are computed as weighted sums () for sentences generated using both the decoding strategies. These rewards are then used to calculate the loss for the RL step. Finally, a weighted sum of losses is computed that is used to estimate the gradients and update model parameters. All the hyperparameter settings used are included in (see also [,,,,,-]).

Figure 3. Reinforcement learning–based training procedure for TESLEA. MLE: maximum likelihood estimation; RL: reinforcement learning. View this figureAutomatic Metrics

Two readability indices were used to perform automatic evaluations of the generated text, namely, FKGL and Automatic Readability Indices (ARIs). The SARI score is a standard metric for TS. The F-1 versions of ROUGE-1 and ROUGE-2 [] scores were also reported. Readers can find more details about these metrics in . To measure the quality of the generated text, the criteria proposed by Yuan et al [] were used, which are mentioned in the “Automatic Evaluation Metrics” section in . The criteria proposed by Yuan et al [] can be automatically computed using a language model–based metric called “BARTScore.” Further details on how to use BARTScore to measure the quality of the generated text are also mentioned in .

Human Evaluations

In this study, 3-domain experts judge the quality of the generated text based on the factors mentioned in the previous section. The evaluators rate the text on a Likert scale from 1 to 5. First, simplified test data were generated using TESLEA, and then 51 generated paragraphs were randomly selected, creating 3 subsets containing 17 paragraphs each. Every evaluator was presented with 2 subsets, that is, a total of 34 complex-simple TESLEA-generated paragraphs. The evaluations were conducted via Google Forms, and the human annotators were asked to measure the quality of simplification for informativeness (INFO), fluency (FLU), coherence (COH), factuality (FAC), and adequacy (ADE) (). All the data collected were stored in CSV files for statistical analysis.

Figure 4. A sample question seen by the human annotator. View this figure
ResultsOverview

This section consists of 3 subsections, namely, (1) Baseline Models, (2) Automatic Evaluations, and (3) Human Evaluations. The first section highlights the baseline models used for comparison and analysis. The second section discusses the results obtained by performing automatic evaluations of the model. The third and final section discusses the results obtained from human assessments and analyzes the relationship between human annotations and automatic metrics.

Baseline Models

TESLEA is compared with other strong baseline models and their details are discussed below:

BART-Fine-tuned: BART-Fine-tuned is a BART-large model fine-tuned using an Lml on the data set proposed by Devaraj et al []. Studies have shown that large pretrained models often perform competitively when fine-tuned for downstream tasks, thus making this a strong competitor.BART-UL: Devaraj et al [] also proposed BART-UL for paragraph-level medical TS. It is the first model to perform paragraph-level medical TS and has achieved strong results on automated metrics. BART-UL was trained using an unlikelihood objective function that penalizes the model for generating technical words (ie, complex words). Further details on the training procedure of BART-UL are described in .MUSS: MUSS [] is a BART-based language model that was trained by mining paraphrases from the CCNet corpus []. MUSS was trained on a data set consisting of 1 million paraphrases, helping it achieve a strong SARI score. Although MUSS is trained on a sentence-level data set, it still serves as a strong baseline for comparison. Further details on the training procedure for MUSS are discussed in .Keep it Simple (KIS): Laban et al [] proposed an unsupervised approach for paragraph-level TS. KIS is trained using RL and uses the GPT-2 model as a backbone. KIS has shown strong performance on SARI scores beating many supervised and unsupervised TS approaches. Additional details on the training procedure for KIS are described in .PEGASUS models: PEGASUS is a transformer-based encoder-decoder model that has achieved state-of-the-art results on many text-summarization data sets. It was specifically designed for the task of text summarization. In our analysis, we used 2 variants of PEGASUS models, namely, (1) PEGASUS-large, the large variant of Pegasus model, and (2) PEGASUS-pubmed-large, the large variant of the PEGASUS model that was pretrained on a PubMed data set. Both the PEGASUS models were fine-tuned using Lml on the data set proposed by Devaraj et al []. For more information regarding the PEGASUS model, the readers are suggested to refer to [].

The models described above are the only ones available for medical TS as of June 2022.

Results of Automatic Metrics

The metrics used for automatic evaluation are FKGL, ARI, ROUGE-1, ROUGE-2, SARI, and BARTScore. The mean readability indices scores (ie, FKGL and ARI) obtained by various models are reported in . ROUGE-1, ROUGE-2, and SARI scores are reported in and BARTScore is reported in .

Table 1. Flesch-Kincaid Grade Level and Automatic Readability Index for the generated text.aTextFlesch-Kincaid Grade LevelAutomatic Readability IndexBaseline


Technical abstracts14.4215.58Gold-standard references13.1115.08Model generated


BART-Fine-tuned13.4515.32BART-UL11.9713.73bTESLEA11.84b13.82MUSSc14.2917.29Keep it Simple14.1517.05PEGASUS-large14.5317.55PEGASUS-pubmed-large16.3519.8

aTESLEA significantly reduces FKGL and ARI scores when compared with plain language summaries.

bBest score.

cMUSS: multilingual unsupervised sentence simplification.

Table 2. ROUGE-1, ROUGE-2, and SARI scores for the generated text.aModelROUGE-1ROUGE-2SARIBART-Fine-tuned0.400.110.39BART-UL0.380.140.40bTESLEA0.390.110.40bMUSSc0.230.030.34Keep it Simple0.230.030.32PEGASUS-large0.44b0.18b0.40bPEGASUS-pubmed-large0.420.160.40b

aTESLEA achieves similar performance to other models. Higher scores of ROUGE-1, ROUGE-2, and SARI are desirable.

bBest performance.

cMUSS: multilingual unsupervised sentence simplification.

Table 3. Faithfulness Score and F-score for the generated text by the models.aModelsFaithfulness ScoreF-scoreBART-Fine-tuned0.1370.078BART-UL0.2420.061TESLEA0.366b0.097bMUSSc0.0310.029Keep it Simple0.0300.028PEGASUS-large0.1970.073PEGASUS-pubmed-large0.290.063

aHigher scores of Faithfulness and F-score are desirable.

bHighest score.

cMUSS: multilingual unsupervised sentence simplification.

Readability Indices, ROUGE, and SARI Scores

The readability indices scores reported in suggest that the FKGL scores obtained by TESLEA are better (ie, a lower score) when compared with the FKGL scores obtained by comparing technical abstracts (ie, complex medical paragraphs available in the data set) with the gold-standard references (ie, simple medical paragraphs corresponding to the complex medical paragraphs). Moreover, TESLEA achieves the lowest FKGL score (11.84) when compared with baseline models, indicating significant improvement in the TS. The results suggest that (1) BART-based transformer models are capable of performing simplification at the paragraph level such that the outputs are at a reduced reading level (FKGL) when compared with technical abstracts, gold-standard references, and baseline models. (2) The proposed method to optimize TS-specific rewards allows the generation of text with greater readability than even the gold-standard references, as indicated by the FKGL scores in . The reduction in FKGL scores can be explained by the fact that FKGL was a part of a reward (RFlesch) that was directly being optimized.

In addition, we report the SARI [] and ROUGE scores [] as shown in . SARI is a standard automatic metric used in sentence-level TS tasks. The ROUGE score is another standard metric in text summarization tasks. The results show that TESLEA matches the performance of baseline models on both ROUGE and SARI scores. Although there are no clear patterns when ROUGE and SARI scores are considered, there are differences in the quality of text generated by these models and these are explained in the “Text Quality Measure” subsection.

Text Quality Measure

There has been significant progress in designing automatic metrics that are able to capture linguistic quality of the text generated by language models. One such metric that is able to measure the quality of generated text is BARTScore []. BARTScore has shown strong correlation with human assessments on various tasks ranging from machine translation to text summarization. BARTScore has 4 different metrics (ie, Faithfulness Score, Precision, Recall, F-score), which can be used to measure different qualities of generated text. Further details on how to use BARTScore are mentioned in .

According to the analysis conducted by Yuan et al [], Faithfulness Score measures 3 aspects of generated text via COH, FLU, and FAC. The F-score measures 2 aspects of generated text (INFO and ADE). In our analysis, we use these 2 variants of BARTScore to measure COH, FLU, FAC, INFO, and ADE. TESLEA achieves the highest values () of Faithfulness Score (0.366) and F-score (0.097), indicating that the rewards designed for the purpose of TS not only help the model in generating simplified text but also on some level preserve the quality of generated text. The F-scores of all the models are relatively poor (ie, scores closer to 1 are desirable). One of the reasons for low F-scores could be the introduction of misinformation or hallucinations in the generated text, a common problem for language models, which could be addressed by adapting training strategies that focus on INFO via the help of rewards or objective functions.

For qualitative analysis we randomly selected 50 sentences from the test data and calculated the average number of tokens based on BART model vocabulary. For the readability measure, we calculated the FKGL scores of these generated texts and noted any textual inconsistencies such as misinformation. The analysis revealed that the text generated by most models was significantly smaller than the gold-standard references (). Furthermore, TESLEA- and BART-UL–generated texts were significantly shorter compared with other baseline models and TESLEA had the lowest FKGL score among all the models as depicted in .

From a qualitative point of view, the sentences generated by most baseline models involve significant duplication of text from the original complex medical paragraph. The outputs generated by the KIS model were incomplete and appear “noisy” in nature. One of the reasons for the noise generation could be because of unstable training due to lack of a huge corpus of domain-specific data. BART-UL–generated paragraphs are simplified as indicated by the FKGL and ARI scores, but they are extractive in nature (ie, the model learns to select simplified sentences from the original medical paragraph and combines them to form a simplification). PEGASUS-pubmed-​large–generated paragraphs are also extractive in nature and similar to BART-UL–generated paragraphs, but it was observed that they were grammatically inconsistent. In contrast to baseline models, the text generated by TESLEA was concise, semantically relevant, and simple, without involving any medical domain–related complex vocabulary. shows an example of text generated by all the models, with blue text indicating the copied text.

In addition to the duplicated text, the models also induced misinformation in the generated text. The most common form of induced misinformation observed was “The evidence is current up to [date],” as shown in . This text error occurred due to the structure of the data (ie, PLS contains statements related to this research, but these statements were not in the original text; thus, the model attempted to add these statements to the generated text although it is not factually correct). Thus considerable attention should be paid to including FAC measures in the training regime of these models. For a more complete assessment of the quality of simplification, human evaluation was conducted using domain experts for the text generated by TESLEA.

Table 4. Average number of tokens and average Flesch-Kincaid Grade Level scores for selected samples.ModelNumber of tokensFlesch-Kincaid Grade LevelTechnical abstracts498.1114.37Gold-standard references269.7412.77TESLEA131.3712.34BART-UL145.0812.66Keep it Simple187.5913.78Multilingual unsupervised sentence simplification193.0713.86PEGASUS-large272.0413.93PEGASUS-pubmed-large150.0015.09Figure 5. Comparison of Text Generated by all the models. The highlighted blue text indicates copying. CI: Confidence Interval; FEV: Force Expiratory Volume; N: Population size; PEV: Peak Expiratory Flow; RR: Respiratory Rate. View this figureFigure 6. Example of misinformation found in Generated text. CIDSL: Cornelia de Lange syndrome; IVIg: Intravenous immune globulin; MS: Multiple Sclerosis; PE: plasma exchange. View this figureHuman Evaluations

For this research, 3 domain experts assessed the quality of generated text, based on factors of INFO, FLU, COH, FAC, and ADE, as proposed by Yuan et al [], which are discussed in . To measure interrater reliability, the percentage agreement between the annotators is calculated, and the results are shown in . The average percentage agreement for the factors of FLU, COH, FAC, and ADE is the highest, indicating that annotators agree among their evaluations.

The average Likert score for each factor is also reported by each rater (). From the data mentioned in , the raters think that the COH and FLU have the highest quality, with the ADE, FAC, and INFO also rated reasonably high.

To further assess whether results obtained by automated metrics truly signify an improvement in the quality of generated text by TESLEA, the Spearman rank correlation coefficient was calculated between human ratings and the automatic metrics for all 51 generated paragraphs (text), with the results shown in . The BARTScore has the highest correlation with human ratings for FLU, FAC, COH, and ADE compared with other metrics. A few text samples along with their human annotations and automated metric scores are shown in and .

Table 5. Average percentage interrater agreement.Interrater agreementInformativeness, %Fluency, %Factuality, %Coherence, %Adequacy, %A1a and A2b82.3582.3582.3570.5982.35A1 and A3c70.5958.8270.5970.5970.59A3 and A252.9470.5974.5174.5164.71Average (% agreement)68.6370.5974.5174.5172.55

aA1: annotator 1.

bA2: annotator 2.

cA3: annotator 3.

Table 6. Average Likert score by each rater for informativeness, fluency, factuality, coherence, and adequacy.RaterInformativenessFluencyFactualityCoherenceAdequacyA13.824.123.913.973.76A23.504.973.594.823.68A34.063.943.853.943.85Average Likert score3.794.343.784.243.76Table 7. Spearman rank correlation coefficient between automatic metrics and human ratings for the text generated by TESLEA.MetricInformativenessFluencyFactualityCoherenceAdequacyROUGE-10.18a–0.04–0.01–0.050.06ROUGE-20.08–0.01–0.05–0.040.05SARI0.09–0.66–0.13–0.010.01BARTScore0.080.32a0.38a0.22a0.07a

aBest result.

Figure 7. Samples of Complex, Simple (Gold) and generated medical paragraphs along with automated metrics and Human annotations. View this figure
DiscussionPrincipal Findings

The most up-to-date research about biomedicine is often inaccessible to the general public due to the domain-specific medical terminology. A way to address this problem is by creating a system that converts complex medical information into a simpler form, thus making it accessible to everyone. In this study, a TS approach was developed that can automatically simplify complex medical paragraphs while maintaining the quality of the generated text. The proposed approach trains the transformer-based BART model to optimize rewards specific for TS, resulting in increased simplicity. The BART model is trained using the proposed RL method to optimize certain rewards that help generate simpler text while maintaining the quality of generated text. As a result, the trained model generates simplified text that reduces the complexity of the original text by 2-grade points, when measured using the FKGL []. From the results obtained, it can be concluded that TESLEA is effective in generating simpler text compared with technical abstracts, the gold-standard references (ie, simple medical paragraphs corresponding to complex medical paragraphs), and the baseline models. Although previous work [] developed baseline models for this task, to the best of our knowledge, this is the first time RL is being applied to the field of medical TS. Moreover, previous studies failed to analyze the quality of the generated text, which this study measures via the factors of FLU, FAC, COH, ADE, and INFO. Manual evaluations of TESLEA-generated text were conducted with the help of domain experts using the aforesaid factors and further research was conducted to analyze which automatic metrics agree with manual annotations using the Spearman rank correlation coefficient. The analysis revealed that BARTScore [] best correlates with the human annotations when evaluated for a text generated by TESLEA, indicating that TESLEA learns to generate semantically relevant and fluent text, which conveys the essential information mentioned in the complex medical paragraph. These results suggest that (1) TESLEA can perform TS of medical paragraphs such that outputs are simple and maintain the quality, (2) the rewards optimized by TESLEA help the model capture syntactic and semantic information, increasing the FLU and COH of outputs, as witnessed when the outputs are evaluated by BARTScore and human annotators.

Limitations and Future Work

Although this research is a significant contribution to the literature on medical TS, the proposed approach does have a few limitations, addressing which can result in even better outputs. TESLEA can generate simpler versions of the text, but in some instances, it induces misinformation, resulting in reduced FAC and INFO of the generated text. Therefore, there is a need to design rewards that consider the FAC and INFO of the generated text. We also plan to conduct extensive human evaluations on a large scale for the text generated by various models (eg, KIS, BART-UL) using domain experts (ie, physicians and medical students).

Transformer-based language models are sensitive to the pretraining regime, so a possible next step is to pretrain a language model on domain-specific raw data sets such as PubMed [], which will help develop domain-specific vocabulary for the model. Including these strategies may help in increasing the simplicity of the generated text.

Conclusion

The interest in and need for TS in the medical domain are of growing interest as the quantity of data is continuously increasing. Automated systems, such as the one proposed in this paper, can dramatically increase accessibility to information for the general public. This work not only provides a technical solution for automated TS, but also lays out and addresses the challenges of evaluating the outputs of such systems, which can be highly subjective. It is the authors’ sincere hope that this work allows other researchers to build on and improve the quality of similar effort.

The authors thank the research team at DaTALab, Lakehead University, for their support. The authors also thank Compute Canada for providing the computational resources without which this research would not have been possible. This research is funded by NSERC Discovery (RGPIN-2017-05377) held by Dr. Vijay Mago. The authors thank Mr. Aditya Singhal (MSc student at Lakehead University) for providing his feedback on the manuscript.

None declared.

Edited by T Hao; submitted 18.03.22; peer-reviewed by T Zhang, S Kim, H Suominen; comments to author 27.06.22; revised version received 08.08.22; accepted 12.10.22; published 18.11.22

©Atharva Phatak, David W Savage, Robert Ohle, Jonathan Smith, Vijay Mago. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 18.11.2022.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

留言 (0)

沒有登入
gif