An annotated corpus of clinical trial publications supporting schema-based relational information extraction

Table 2 shows the number of annotations for individual entities and major complex entities in each disease corpus and the joint corpus Glaucoma-T2DM. It can be seen that the number of annotations is almost balanced in the two disease corpora, except for the number of annotations of Endpoint and Result instances, which is higher in the T2DM corpus than in the glaucoma corpus. This may be due to the fact that T2DM studies typically include more endpoints compared to glaucoma studies, yielding a higher number of outcomes reported. It should be noted that the endpoints and their outcomes can refer to both primary and secondary outcomes, as well as adverse events.

Table 2 Number of annotations of single entities and main complex entities

To form the final corpus composed of both entity types, the complex entities (i.e., slot-fill templates) were exported to n-triple RDF format and the individual entities to CoNLL-style files. Figure 7 presents an example consisting of different files corresponding to the Annotations in Figs. 6 and 8, and which are part of the resulting corpus.

Fig. 7figure 7

Example of annotations exported into CoNLL and RDF formats. The CoNLL-style file of entity annotations is at the top and the n-triple file of slot-template annotations is at the bottom

Fig. 8figure 8

Example of the annotation of the complex entity ClinicalTrial with SANTO (abstract PMID 29110647). The red arrows indicate the single entities (see Fig. 6) that fill the corresponding slots. The slots hasArm, hasPopulation and hasDiffBetweenGroups contain references to other complex entities

Inter-annotation agreement

Inter-annotation agreement (IAA) helps to assess the reliability of the annotations of independent annotators over a corpus. A high IAA denotes that the annotation task has been well defined and that the annotations are consistent among annotators. Therefore, the annotations could be reproduced at other times and in similar contexts (e.g. other diseases). Thus, we calculated IAA for both single and complex entities as described in the remainder of this section.

Inter-annotator agreement on single entities

As our corpus contains fine-grained annotated entities, the IAA considers cases such as partial and exact annotation matches, and overlapping and embedded annotations as the ones depicted in Fig. 9.

Fig. 9figure 9

Example of annotations cases (excerpts from abstracts PubMed 8123096 and 23950156). a overlapping of annotations of the same type, b embedded annotation of the same type (i.e., Drug), and c embedded annotations of different types

We rely on Cohen’s Kappa [19, 20] at the token level and for each annotation type that is accepted as slot-filler. Cohen’s Kappa is calculated as follows:

Here, P(A) denotes the proportion of times that the two annotators agree, and P(E) is the proportion of times that it is expected they agree by chance. Kappa values lower than 0 indicate no agreement, 0-0.20 a slight, 0.21-0.40 a fair, 0.41-0.60 a moderate, 0.61-0.80 a substantial, and 0.81-1 an almost perfect agreement [21].

Since there are more than 300 categories of annotations, we grouped the annotations into the most general categories (i.e., ancestor classes), correspondingly. For example, DisorderOrSyndrome subsumes Glaucoma and AngleClosureGlaucoma. In addition to calculate Kappa for each annotation category, we also calculate the average Kappa for the whole corpus. As Table 3 shows, the average Kappa values for glaucoma and T2DM are 0.74 and 0.68, respectively, denoting a substantial agreement. These results show that, although the clinical trials for glaucoma and T2DM differ in some of their characteristics, the IAA for both is substantial, suggesting that a good level of IAA could be reached in various disease contexts.

Table 3 Kappa values for the annotation of single entities. The hyphens indicate that no entities were annotated with the corresponding category

Further, it can be seen in Table 3 that there is a high IAA on the annotation of entities that are frequently reported in clinical trials and that are related to the comparison of treatments, such as EndPointDescription (g=0.79, d=0.77), Drug (g=0.83, d=0.91), DoseValue (g=0.96, d=0.62), ChangeValue (g=0.96, d=0.77), RelativeChangeValue (g=0.97, d=0.77), and ResultMeasuredValue (g=0.93, d=0.82) among others. The kappa values for these annotations are mostly higher for glaucoma (g) than for T2DM (d).

Causes of disagreement To analyze the causes of disagreement on the annotations, we classify the annotated entities as numeric or textual. The numerical entities are, for example, the result values, p-values, etc., and the textual entities are the descriptions of preconditions, objective of the study, etc. The main source of disagreement in the annotation of textual entities was mainly due to the different limits of the length of the text assigned by each annotator. On the other hand, the disagreement on the annotations of numerical entities was lower than for textual entities. One of the most frequent causes of such disagreement was the exclusion/inclusion of a minus/plus sign in front of the annotated number. Another cause of disagreement was the annotation of homonymous entities. For example, when some annotators annotated the symbol “%” as a unit of concentration and other annotators as a rate value.

Templates with low agreement for both diseases are, for example, ConflictInterest (g=0, d=0), MeasurementDevice (g=0.19, d=0), and ObservedResult (g=0.25, d=0.06), which correspond to infrequent textual entities in abstracts. Future work will reveal if these slots are only frequent in our data sample or generally infrequent. The importance of annotating these entities will have to be determined depending on how crucial they are to support the use case of automatically aggregating evidence from multiple clinical trials.

Inter-annotator agreement on the annotation of complex entities

Measuring agreement on complex entities requires that the annotators agree on 1.) the number of instances of a given composed class and 2.) the semantic structure of these instances according to the underlying ontology. In terms of slot-filling templates, this means that the annotators should agree on the number of instances of complex templates, the relationships between them, and their slot-filler entities.

Since the calculation of the IAA on complex entities implies checking several elements, we selected a sample of 20 abstracts (i.e., around 10% of the total number of abstracts in the corpus) formed of 10 abstracts on glaucoma and 10 abstracts on T2DM that were slot-annotated by two different annotators. In this way, we could analyze the obtained IAA in more detail. The F1 score was used to evaluate the IAA on complex entities, considering that a F1 score of 1 represents a perfect agreement. The F1 score is the harmonic mean of precision and recall as defined in Eq. (2), where tp are true positives, fp false positives and fn false negative predictions.

First, F1 is computed to compare the annotated entities assigned to single-entity slots of complex entities labeled by two different annotators.

In case there is more than one instance of the same complex-entity type in each annotation set, then F1 is calculated for the different combinations of instances to estimate the best pairwise alignment, i.e., the pair with the highest F1. Then, recall and precision are updated for the slots of the compared instances according to this alignment. Note that only single-entity slot-fillers are considered for computing the best alignment.

Figure 10 depicts an example of the statistics (i.e., tp/fp/fn) for computing F1 to compare the complex-entity annotations of two annotators, where one of them is the gold-standard. If the single entities assigned to the slots of the pair of complex entities being compared match in type and value, then this counts as a true positive (tp). Otherwise, as a false positive (fp). When an annotation that exists in the gold-standard is missing, this counts as a false negative (fn).

Fig. 10figure 10

Example of the statistics for F1 to determine the IAA on complex entities between two annotators. I: the entity is an Individual. Otherwise, the entity is a literal value

The F1 scores for slots which have complex entities as slot-fillers are calculated using the previously computed best alignments. A pair of complex-entity slot-fillers is considered to be a tp if these slot-fillers have been aligned. For example, in the case presented in Fig. 10, Annotator1 identifies two instances of the complex entity Medication called M1 and M2, and Annotator2 identifies only one MedicationM3. Thus, there are two pairs: (M1, M3) and (M2, M3). F1 is calculated for each pair and the one with the higher F1 is selected. Assuming that (M2, M3) is the best alignment, then the F1 score for (M2, M3) is used to update the number of tp, fp and tn. Because M1 does not have a corresponding peer, then all its slot-filler entities are counted as fp.

The position of the annotations in the text is not evaluated, since a slot can be filled with any annotation that fulfills the allowed type for this slot, regardless of its position in the text. For example, in a given abstract there are two entities annotated as Insuline, one at position 5 and one at position 25. One annotator chooses the entity at position 5 to fill the hasDrug slot in an instance of Medication and another annotator chooses the entity at position 25. Both entities are appropriate for this slot.

Since the distribution of the number of slot-fillers with respect to the slot types is unbalanced, to measure the overall agreement, we used the micro-averaged F1 score, which allows weighting each prediction equally. The micro-average score is calculated from the true positives (tp), false positives (fp), and false negatives (fn) of the individual slot types. That is, the tp, fp and fn over all the slot types are summed up and inserted into Eq. (2). The overall agreement reached is 0.81 as shown in Table 4.

Table 4 F1 scores for the IAA on complex entities in the glaucoma-T2DM corpus of 20 abstracts. Slot-fillers that contain reference to other complex entities are in italics and single entity slot-fillers in normal font

In Table 4, we can also observe that the F1 scores obtained for most of the single-entity slot-fillers of the complex entities range between 0.78 and 1.00, denoting a high agreement. On the other hand, the lower F1 scores range from 0.32 to 0.69 for mostly the complex-entity slot-fillers, like hasOutcome, hasMedication, hasEndpoint, etc. This shows that the annotators disagree more on cross-referencing complex entities, i.e. when the slot-fillers refer to other complex entities than when the slot-fillers are single entities.

One of the causes of high IAA may be the fact that the slot-filling annotation was done on a corpus that contains curated annotations of single entities. On the other hand, we observe the following causes of disagreement:

i) The annotators miss to fill some slots,

ii) The annotators conceptualize the complex entities differently from what is stated in the guidelines. For example, this is the case when the annotators consider the treatments applied before randomization as part of the compared interventions rather than as part of the preconditions. For instance, in the following excerpt (abstract PMID 24237386) the drug “metmorfin” is part of the pre-condition since a criterion of eligibility for the clinical trial is that the participants have previously received a metformin treatment. However, the annotator created an intervention whose drug is metformin.

“Aim: This randomized, double-blind, placebo-controlled parallel-group study assessed the effects of sodium glucose cotransporter 2 inhibition by dapagliflozin on insulin sensitivity and secretion in subjects with type 2 diabetes mellitus (T2DM), who had inadequate glycemic control with metformin…”

Another example is when two drugs that are part of a fixed drug combination are mistakenly considered separately in two interventions, instead of in a single intervention. For example, in:

“Fixed-combination brimonidine-timolol versus latanoprost in glaucoma and ocular hypertension: a 12-week, randomized, comparison study.”

“brimonidine-timolol” should be annotated as the fixed drug combination Brimo/TimFC that belongs to a single intervention. Nevertheless, sometimes the annotators created two interventions for the same arm, one intervention for brimonidine and another for timolol.

Baseline method for single entity recognition

We carried out the recognition of single entities both in abstracts and full-text articles in order to compare how a system trained on annotated abstracts performs on these two types of text.

We used a BERT-based approach. BERT (Bidirectional Encoder Representations from Transformers) [22] is a language representation model designed to pretrain deep bidirectional representations from unlabeled text. BERT has been pretrained on a vast amount of unannotated text that is, for example, available on the web. After pretraining, BERT can be fine-tuned on smaller datasets for specific NLP tasks or domains.

We used a pretrained BERT model on MEDLINE/PubMed abstractsFootnote 7. We fine-tuned this model by adding two layers that predict the start and end positions of entities per entity type. If a token at position ps in a given sentence is predicted to be the start token for an entity of type t, then the corresponding end token is given by the nearest token at position pe≥ps, which is predicted as the end position for the entity of type t. If there is no corresponding end token for a predicted start token, then the predicted start token is ignored. We trained the model with the Adam optimizer [23] for 30 epochs. We only consider entity types that occur at least 20 times in the respective training set. We report precision, recall, and F1 on the test sets with exact matching of entities. We consider a predicted entity in a given sentence to be correctly classified if there is an entity in the test set of the same type and the same start and end positions.

Entity recognition on abstracts

Table 5 shows the results of the recognition of single entities. We can observe that the micro average F1 scores are similar both for glaucoma and T2DM. The entities Drug and EndPointDescription obtained very low scores for glaucoma, while for T2DM these entities reached high scores. In the case of the Drug entity in glaucoma, the low scores may be due to the common presence of fixed combination drugs which are used as treatments for this disease. For example, in the fixed combination “brimozol/timolol”, brimozol and timolol are each annotated as single Drug entities, while brimozol/timolol is annotated as a single Drug entity that spans the two single Drug entities. It seems that the baseline method is not able to recognize this type of overlapping entity.

Table 5 Results of the single entity prediction with EXACT match on abstracts. The hyphens indicate that the entities do not appear in the respective datasets

We can also observe that, in general, entities that are long textual descriptions (e.g., ConclusionComment, ObservedResults and ObjectiveDescription) tend to get low scores with exact match. They may get higher scores with partial matching.

Entity recognition coverage on full-text articles

In order to see the performance of the baseline system fine-tuned with our abstract corpus on the task of recognizing entities with exact matches in full-text articles, we created a new test dataset. This dataset is composed of full articles that are freely available and correspond to some of the abstracts included in the test datasets for the previous experiment. The new test dataset is composed of 20 full-text articles, of which 13 articles are on T2DM and 7 on glaucoma. The abstracts, figures, tables, and references were removed from these files.

We used the fine-tuned BERT model on the full-text article dataset. We calculated the exact match by checking how many of the predicted entities were also tagged in the corresponding curated annotated abstracts (here called ground truth set). The results of this coverage are shown in Table 6.

Table 6 Results of the entity prediction with EXACT match on full text articles. The hyphens indicate that the entities do not appear in the respective datasets

The low scores obtained for the meta-information (i.e., Publication: Author, Title, Journal, PublicationYear, PMID, and Country) of the clinical trials for both diseases were mainly due to the different formats of the free full-text articles. For example, in the abstracts the format for author name is [surname(s) name initial(s)], while in the full-text is: [name(s) surname]. For instance, in the abstract PMID 27740719, the name of the first author is written as “Shankar RR”, while in the corresponding full text (PMCID: PMC5415484) is “R Ravi Shankar”. Because the system compares exact matches it considers these author names as mismatches.

Possible causes that the system did not find relevant information, such as baseline data, results values, and the difference between groups are: 1.) these data were included in figures or tables, which were eliminated; 2.) the baseline system could not adequately predict them as it was pretrained and fine-tuned with abstracts that have a different structure from that of full texts; 3.) our comparison was quite strict when comparing to exact matches. With partial matches, higher scores may be obtained.

We also tried a simple partial match, where a predicted entity was considered correct if there was an entity in the ground truth set with at least one overlapping token. Then, this entity in the ground truth set could not be used for any other subsequent alignment. The results in Table 7 with partial match show that the average precision scores for glaucoma and T2DM are similar to the ones reached with exact match in Table 6, while the average recall for both diseases increased.

Table 7 Results of the entity prediction on full text articles with PARTIAL match. The hyphens indicate that the entities do not appear in the respective datasets

Notice that a more complex partial matching method that considers overlapping of entities of the same type and embedded entities of the same and different type would give more precise results that the one used.

留言 (0)

沒有登入
gif