SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

In this section, details are provided regarding the EHR data used in our study and all aspects associated with the annotation schema, including the guideline definition, annotation tool development, annotation process itself, corpus reliability calculation, and segmentation of the created ground-truth. A broad perspective on our methodology is presented in Fig. 1.

Fig. 1figure 1

A broad view of SemClinBr corpus development. The diagram is an overview of the SemClinBr corpus development, which shows the selection of thousands of clinical notes from multiple hospitals and medical specialties. A multidisciplinary team developed the elements in orange, representing (i) the fine-grained annotation schema following the UMLS semantic types and (ii) the web-based annotation tool featuring the UMLS REST API. These resources supported the generation of the ground truth (i.e., gold standard), which was evaluated intrinsically (i.e., inter-annotation agreement) and extrinsically in two different NLP tasks (i.e., named entity recognition and negation detection)

Data preparation

The data were obtained from two different sources: (1) a corpus of 2,094,929 entries from a group of hospitals in Brazil generated between 2013 and 2018, and (2) a corpus originating from a university hospital based on entries in the period between 2002 and 2007, which accounts for 5.617 of the entries. In the first dataset, each entry was associated with structured data (e.g., gender, medical specialty, entry date) as well as unstructured data in free-text format, representing sections of a clinical narrative (e.g., disease history, family history, and main complaint). Data were obtained from the records of approximately 553,952 patients. In addition to the multi-institutional aspect of the corpus, it covers various medical specialties (e.g., cardiology, nephrology, and endocrinology) and clinical narrative types (e.g., discharge summaries, nursing notes, admission notes, and ambulatory notes).

The second dataset had only one document type (discharge summaries) and came exclusively from the Cardiology Service Center. The data configuration had structured data (i.e., gender, birth date, start date, end date, and icd-10 code) and only one free-text data field for the discharge summary. The texts from both datasets shared some already known characteristics related to clinical narratives in general [49], such as uncertainty, redundancy (often due to copy and paste), high use of acronyms and medical jargon, misspellings, fragmented sentences, punctuation issues, and incorrect lower- and uppercasing. Some examples of text are presented in Table 1. The first example is a discharge summary from the cardiology section, with the complete history of care provided by regular / long sentences with no apparent format standardization. The second sample shows an ambulatory note describing a patient visit to the nephrology department, which includes concise sentences written in uppercase letters, a high frequency of acronyms, and a lack of punctuation. The third text is a nursing note describing the monitoring of the nursing team by the patient. The de-identification process is described in the “annotation tool” section.

Table 1 Samples of different types of clinical narratives from our corpusDocument selection

The original and primary focus of the intended semantic annotation was two-fold: (i) to support the development of an NER algorithm to be used in a summarization method and (ii) to evaluate a semantic search algorithm focusing both on the cardiology and nephrology specialties. Thus, almost 500 clinical notes were randomly selected from both medical specialties (including the complete longitudinal records of two patients). To compensate for the lack of corpora for pt-br, the scope of this study was increased to support other biomedical natural language processing (bio-NLP) tasks and medical specialties. Documents from other medical areas were randomly selected to complete 1000 clinical narratives, assuming that the data were consistent and representative for the training of a ML model. Table 2 presents the number of documents per specialty. The average character token size was ~ 148 and the average sentence size was approximately ten tokens.

Table 2 The medical specialties frequency table

Note that several documents were categorized as “Not defined” because this is one of the majority classes in the corpus received. When these documents are analyzed, it is concluded that these patients are (a) under the care of multiple medical specialties (e.g., patients with multiple traumas in the intensive care unit) or (b) in the middle of a diagnostic investigation. The specialties with less than ten documents were further grouped as “Others” (e.g., urology, oncology, gynecology, rheumatology, proctology). Regarding document types, the selected documents were represented by 126 discharge summaries, 148 admission notes, 232 nursing notes, and 506 ambulatory notes.

Text organization

In Table 3, available data are presented for each entry in the database (concerning first the main data source). To obtain a unique text file per entry, all the free-text fields were concatenated into a single text document to be annotated. In addition to the already-known issues in clinical texts, our database presented other problems. The medical staff were supposed to write all patient clinical notes in free-text fields. The EHR application has one textbox for each field, and these sections serve as clinical narrative sections. However, most clinicians entered all of the text in the history-of-disease field only, while leaving other fields blank, making it difficult to search for specific information in the narrative (e.g., look for family history). In addition, some text was entirely written in uppercase letters interfering directly with text processing, such as finding abbreviations and identifying proper nouns.

Table 3 Database entry data configurationAnnotation schema

In this section, we describe the entire annotation schema, including the conception and evolution of the annotation guidelines, the development of a tool to support and improve the annotation workflow, and an overview of the annotation process and its experimental setup. The steps of the annotation process considered the lessons learned from other similar annotation projects reviewed in Methods section.

Annotation guidelines

To ensure gold standard quality, it is crucial to maintain the homogeneity of the annotation during the entire process. To provide guidance to annotators and answer possible questions, a set of guidelines were defined to explain, in detail, how to annotate each type of concept with examples to illustrate what should be annotated and what should not.

The first step was to define the information to annotate within the text. Regarding the clinical concepts, UMLS semantic types (STY) were opted for as annotation tags (e.g., “Body Location or Region,” “Clinical Attribute,” “Diagnostic Procedure,” “Disease or Syndrome,” “Finding,” “Laboratory or Test Result,” “Sign or Symptom,” and “Therapeutic or Preventive Procedure”). Table 4 presents some of the most commonly used STYs with examples.

Table 4 Text samples containing the most used STYs

The choice of UMLS STYs was to ensure the high-granularity of the annotation, to establish the ground-truth for the evaluation of a semantic search algorithm that labels entities using the STYs (more than a hundred types). The risk of agreement loss (as a result of the reliability and granularity trade-off discussed in Section 2) was acceptable; however, the greater coverage of the concepts in the corpus allowed the gold standard to be utilized in a higher number of bio-NLP tasks. Even when the task has low granularity, it is possible to export the actual annotations to their corresponding UMLS semantic groups (SGR). The second reason for the use of the UMLS STYs was its reliance on the UMLS Metathesaurus resource, which can serve as an important guide to annotators, as they can search for a specific concept to ensure that it is the STY they are annotating.

The UMLS REST API allows the annotation tool to automatically suggest STYs for clinical concepts. As the STYs do not cover two important bio-NLP tasks, two more types were added to our tag set, the “Negation” and “Abbreviation” tags. The first aims to identify negation cues associated with clinical concepts (already tested in the negation detection algorithm presented in a later section). The “Abbreviation” type was incorporated to help in the process of abbreviation disambiguation. It is important to emphasize that these two extra STYs can complement those adopted by the UMLS. For example, it is possible to mark the term “PA” (blood pressure) as Clinical Attribute and Abbreviation at the same time.

Sometimes, when extracting semantic meaning from clinical text, the semantic value of a concept alone is not sufficient to infer important events and situations. Hence, the annotation of the relationships is incorporated between clinical concepts and the guidelines. The relation annotation schema was partially derived from the UMLS relationship hierarchy. Unlike the concept schema, a restricted set of tags was used, instead of 54 UMLS relationship types (RTY), to simplify the relation annotation which was not our main focus. The RTYs included only the “associated_with” and “negation_of”, added to complement the Negation STY (not a UMLS RTY). There are five major non-hierarchical RTYs (i.e., conceptually_related_to, functionally_related_to, physically_related_to, spatially_related_to, and temporally_related_to) that connect concepts by their semantic values. They are represented using their parent RTYs only, the “associated_with” RTY. Depending on the selected STY, it is possible to infer the sub-types of “associated_with” automatically. Once the concepts and relationships were defined, an annotation script was established, whereby the annotator first labeled all of the concepts and then annotated the relations. This order was adopted because Campillos et al. [15] found that the agreement between annotators was higher when annotation was performed this way.

Deleger et al. [41] stated that the most challenging STY to annotate was “Finding,” because it is a very broad type that can correspond to signs / symptoms (e.g., “fever”), disease / disorders (e.g., “severe asthma”), laboratory or test results (e.g., “abnormal ECG”), and general observations (e.g., “in good health”). To avoid ambiguity, the definition of “Finding” was simplified. Annotators would always give preference to disease / disorders and lab result STYs over the “Finding” STYs. Only results of physical examination considered normal would be marked as “Finding” (e.g., “flat abdomen” and “present diuresis”). The abnormal ones would be labeled as “Sign or Symptom.” This can cause discrepancies between UMLS concepts and our annotation; however, it makes sense for our task. Using these definitions, the first draft of the guidelines was prepared and handed to the annotators.

Furthermore, training was provided to acquaint the participants with the annotation tool and allow them to realize some of the difficulties of the process. Then, an iterative process was started to enhance the guidelines, check the consistency of annotation between annotators (more details on the inter-annotator agreement are provided in the Results section), and provide feedback on the annotators’ work. When in three consecutive rounds, the agreement remained stable (no significant reduction or improvement), it was assumed there was no room for guideline adaptation, and the final annotation process could be initiated. A flowchart of the process is shown in Fig. 2 and is similar to that of Roberts et al. [6] and others.

Fig. 2figure 2

Revision and quality verification process of the annotation guidelines. The iterative process started with the first guideline draft; then, a small number of documents were double-annotated, and their inter-annotator agreement was calculated. If the agreement remained stable, then the guideline was considered good enough to proceed with the gold standard production. Otherwise, the annotation differences were discussed; the guidelines were updated; and the process was reinitiated

It is important to emphasize that even after reaching a stable agreement, the quality of the annotations continue to be evaluated and discussed among annotation teams to avoid the continuous propagation of possible errors and disparities that may arise.

Annotation tool

The previously discussed issues and difficulties related to clinical annotation indicate the need for an annotation tool that can ease and accelerate the annotators’ work. After analyzing Andrade et al.’s [50] review of annotation methods and tools applied to clinical texts, we decided to build our own tool. This approach ensures that all annotators can share the same annotation environment in real time and work anywhere / anytime without technical barriers (i.e., web-based applications). Furthermore, the project manager could better supervise and organize all the work and assign the remaining work to team members involved, without the need for a presential meeting because the participants had very different and irregular time schedules. Moreover, as UMLS semantic types were used in our schema, it would be desirable to use the UMLS API and other local resources (e.g., clinical dictionaries) supporting text annotation to make annotation suggestions to the user without pre-annotating it. Finally, a tool was required to fit in exactly into our annotation workflow, with the raw data input into our environment and a gold standard output at the end of the process, dispensing the use of external applications. The workflow of our tool consists of six main modules:

Importation: import data files into the system

Review: manually remove PHI information that the anonymization algorithm failed to catch

Assignment: allocate text to annotators

Annotation: allow labeling of the clinical concepts within the text with one or multiple semantic types, supported by the Annotation Assistant feature

Adjudication: resolve double-annotation disparities in the creation of the gold standard

Exportation: export the gold standard in JSON or XML

The annotation assistant component was developed to prevent annotators from labeling all the text from scratch by providing them with suggestions for possible annotations based on (a) previously made annotations and (b) UMLS API exact-match and minor edit-distance lookup. The UMLS 2013AA version, which was adapted to pt-br by Oliveira et al. [51] was employed. Further details on the technical aspects, module functionalities, and experiments showing how the tool affects annotation time and performance are reported in [52].

Annotation process

In addition to the advice and recommendations found in previous sections, similar to Roberts et al. [6], a well-known annotation methodology standard [53] was adopted. Furthermore, a guideline agreement step was added such that all the text was double-annotated with the differences resolved by a third experienced annotator (i.e., adjudicator), whereby documents with low agreement were excluded from the gold standard. Pairing annotators to perform a double annotation of a document prevents bias caused by possible mannerisms and recurrent errors of a single annotator. Moreover, it was possible to check the annotation quality by measuring the agreement between the two annotators.

It is almost impossible to achieve absolute truth in an intricate annotation effort such as this one. To reach a consistent ground truth as closely as possible, an adjudicator was responsible for resolving the differences between the annotators. It is worth mentioning that the adjudicator could remove annotations made by both annotators and did not create new annotations, to avoid hampering the creation of a gold standard based on the opinion of a single person. After the guideline maturation process, in the final development stage of the gold standard, the process was retrospectively divided into ground-truth phases 1 and 2. Annotators with different profiles and levels of expertise were recruited to provide different points of view during the guideline definition process and to determine whether there were differences in the annotation performance between annotators with different profiles.

Ground-truth Phase 1 included a team of three persons: (1) a physician with experience in documenting patient care and participation in a previous clinical text annotation project; (2) an experienced nurse; and (3) a medical student who already had ambulatory and EHR experience. The nurse and medical student were responsible for the double-annotation of the text, and the physician was responsible for adjudication. When the process was almost 50% complete (with 496 documents annotated and adjudicated), more people were recruited to assist in finishing the task (called ground-truth phase 2). An extra team of six medical students with the same background as the first one, were recruited (Fig. 3 illustrates the phases). A meeting was held to present the actual guidelines document and trained the participants on using the annotation tool.

Fig. 3figure 3

Annotation process overview. The annotation process was divided into ground-truth phases 1 and 2, which are located above and below the dashed line, respectively. The elements in green represent the annotators and orange represents the adjudicators

In Phase 2, there were two adjudicators: the physician and the nurse. The nurse was added as an adjudicator, as one extra adjudicator was needed during this phase, and the nurse had more hospital experience than medical student 1. Then a homogeneous group of seven medical students was formed to annotate the texts. The physician, nurse, and medical student 1 supervised the first set of annotations for all students. The number of documents to be annotated was divided equally between the annotators and adjudicators, and the selection of double annotators for each document was made randomly, as was done for the adjudicators. It is worth noting that in addition to those mentioned, who worked directly with annotation and adjudication, there was a team of health informatics researchers who participated in supporting the annotation project with other activities, including annotation tool development, guideline discussion, and annotation feedback.

Corpus reliability and segmentation

Taking advantage of the fact that the entire collection of documents were double-annotated, the IAA of all the data was calculated using the observed agreement metric, as presented in the following equation (no need for chance-correction calculations, as described in Methods section). The following four metrics were used:

Strict (full span and STY match)

Lenient (partial span and STY match)

Flexible (full span and SGR match)

Relaxed (partial span and SGR match)

For the strict version of IAA, a situation was considered a match when the two annotators labeled the same textual span with equivalent semantic type. All other cases were considered nonmatches. The lenient version of IAA, considered partial matches, that is, annotations that overlap in the selected textual spans (with the same STY); these are counted as a half-match in the formula. The third version of IAA, called flexible, transformed the annotated STY to its corresponding SGR (e.g., “Sign or Symptom,” “Finding,” and “Disease or Syndrome” STYs were converted to “Disorder” SGR). prior to performing a comparison to determine whether the SGRs were equal (the textual span needed to be the same). Finally, the fourth version of IAA was relaxed, that is, partial textual spans (overlaps) and SGRs were considered at the same time.

To isolate the concept agreement scores from the relationship score, the relationship between IAA values was reported by considering only those relationships in which both annotators labeled two of the connected concepts. Otherwise, if an annotator did not find one of the concepts involved, the IAA relationship was directly penalized.

Boisen et al. [53] recommended that only documents with an acceptable level of agreement should be included in the gold standard, as followed in this work. However, because of the scarcity of this type of data in pt-br bio-NLP research, and because the limited amount of annotated data is often a bottleneck in ML [54], no documents were excluded from our corpus; instead the documents were segmented into two, namely gold and silver. This division was made based on the IAA values of each annotated document such that documents with an IAA greater than 0.67 belonged to the gold standard and all others to silver. The threshold of 0.67 was selected because according to Artstein and Poesio [44], it is a tolerable value. The threshold of 0.8 is thought to be rigorous, considering the complexity of our task and the number of persons involved in it. Task complexity is explained by the heterogeneity of the data obtained from multiple institutions, medical specialties, and different types of clinical narratives. The study closest to ours in data diversity is that of Patel et al. [34], with the exception that their data were obtained from a single institution. Moreover, despite the large amount of data they used, there were differences between their study and ours; for example, they used a coarse-grained annotation scheme by grouping the STYs, which made the labeling less prone to errors. Moreover, we believe that a significant portion of errors that cause disagreements come from repeated mistakes by one of a pair of annotators. Thus, the error could be easily corrected by the adjudicator, as the examples in the following sections reveal.

留言 (0)

沒有登入
gif