MedLexSp – a medical lexicon for Spanish medical natural language processing

This section summarizes the methodology reported in [16], and explains the word-embedding-based method to collect new terms about the COVID-19 pandemic. Figure 2 depicts the approaches to create MedLexSp. Note that methods might be generalized across languages provided that similar resources are available.

Fig. 2figure 2

Methods applied to collect the MedLexSp lexicon

Base list

First, we used a list of medical terms developed by [44]—hereafter, the base list. This resource was collected from a corpus of Spanish medical texts (around 4 million tokens) by applying rules, part-of-speech tagging and medical affixes, comparing general and domain corpora, and statistical methods. The base list amounted to 38 354 tokens (base and variant forms). Not all the terms in the list were used to prepare MedLexSp. This lexicon was aimed at concept normalization, mainly using standard terminologies. To do so, we used the UMLS, thus MedLexSp only includes terms mapped to Concept Unique Identifiers (CUIs). Approximately, 47.61% entries of the original base list were mapped to CUIs, applying an exact match criterion. For example, the CUI for neoplasia (‘neoplasm’, C0027651) was not assigned to neoplasia benigna (‘benign neoplasm’, C0086692), because these terms refer to different concepts. Once a stable list of terms was achieved, MedLexSp was enriched with several sources, as explained in the following sections.

Acronyms and abbreviations

We reused a dictionary collected by medical doctors [45], acronyms from Wikipedia, and the resources provided in the Biomedical Abbreviation Recognition and Resolution Challenge [46]. Acronyms and abbreviations were matched to UMLS CUIs semi-automatically and revised manually. This revision was essential because many are ambiguous: e.g. IM stands for insuficiencia mitral (‘mitral insufficiency’), infarto de miocardio (‘myocardial infarction’) or intramuscular (‘intramuscular’). Other items are invariant in English and Spanish (e.g. kg, ‘kilogram’) and the mapping was automatic. With these methods, the CUI of each acronym (e.g. EV, C0014383) was assigned to each full form (enterovirus), and vice versa. A complimentary list of equivalences between acronyms/abbreviations and full forms (LR_abr.dsv) is also distributed.

Affixes and roots

We translated items from the Specialist Lexicon [36] (e.g. reno-, ‘kidney’), and reused a list from a previous work [47]. This lists gathers suffixes recommended by the World Health Organization [48] to coin new drug terms: e.g. -cilina (‘-cilin’) is used for a penicillin drugs. Morphological variants of affixes were created, including gender/number alternations (e.g. -scópico, -scópica and -scópicos, ‘-scopic’) or variants with tilde (-scopia and -scopía, ‘-scopy’). Then, a subset of items were mapped to UMLS CUIs, and variants were clustered for each base form and CUI. For example, the suffix -cilina was mapped to CUI C0030842 for ‘penicillins’, and all form variants (-cilina, -cilinas) were clustered. A complimentary Lexical Record file (LR_affix.dsv) provides the equivalence between affixes/roots and their meanings.

Conjugated verbs

Medical events are commonly expressed with nouns (sangrado, ‘bleed’), but verbs may be used as well (sangrar, ‘to bleed’, C0019080). For this reason, state-of-the-art lexicons [14, 36, 49] gather verb terms, and we proceed similarly in MedLexSp. From a list of medical verbs, we generated conjugated variants by using a python script and the lexicon of a Spanish part-of-speech tagger [50]: e.g. sangrar (‘to bleed’) \(\rightarrow\)sangra (‘he/she/it bleeds’), sangrando (‘bleeding’), sangrado (‘bled’)... Then, the CUI of each noun term was assigned to the corresponding verb term.

Derivational variants

By using lists of morphological and semantic variants, we mapped noun terms to adjective variant forms: e.g. hígado, ‘liver’ \(\leftrightarrow\)hepático, ‘hepatic’ (C0023884). We also matched deverbal nouns and verbs (diálisis, ‘dialysis’ \(\leftrightarrow\)dializar, ‘to dialyze’, C4551529). Note that larger lists were collected, but only a subset (801 items) was mapped to UMLS CUIs. The full lists of deverbal nouns are also released as complementary lexical record (LR) files. The list of deverbal nouns (LR_n_v.dsv) amounts up to 535 entries. The list of adjectives derived from nouns (LR_adj_n.dsv) gathers 2366 entries, including morphological variants (e.g. abdomen\(\leftrightarrow\)abdominal) and non-morphologically related pairs (e.g. oncológico, ‘oncological’ \(\leftrightarrow\)cáncer, ‘cancer’).

String distance metrics

We computed string distance metrics [51] of \(\le\)2 between the terms with a CUI available, and unattested variants in thesauri. The selected candidates were revised manually, to match CUIs to new variant forms. This procedure was useful for character-level variants (e.g. viriasis\(\leftrightarrow\)viriosis, ‘viral infection’, C0042769), hyphenated variants and tokenization variants (betabloqueante\(\leftrightarrow\)beta-bloqueante\(\leftrightarrow\)beta bloqueante, ‘beta-blocker’, C0001645).

Syntactic variants of terms

We created variants of multi-word terms in available thesauri. Word order was swapped, and the UMLS Concept Unique Identifier of the original form was assigned to the new variants. The form variants were obtained automatically with a python script, and then they were revised manually. With this method, for example, the CUI of virus respiratorio sincitial (‘respiratory syncytial virus’, C0035236) was matched to the variant form virus sincitial respiratorio (‘respiratory syncytial virus’).

Terms from thesauri, dictionaries and knowledge bases

Health thesauri, knowledge bases, classifications and taxonomies were used to widen the coverage of terms. We collected variants of terms in the base list by means of UMLS CUIs mapped to alternative forms from the following resources:

1

The Anatomical Therapeutic Chemical (ATC) Classification [31]: this is a WHO standard to classify medical drugs according to their therapeutic and pharmacological properties. It comprises five levels, from the system or organ class (e.g. nervous system drugs) to the active ingredient (e.g. diazepam). By including data from the ATC, MedLexSp ensures to provide a exhaustive range of medical drug terms.

2

The Dictionary of Medical Terms (DTM) by the Spanish Royal Academy of Medicine [17]: this is the key contribution of this version of MedLexSp. This resource covers both technical words and consumer health terms. Note that the DTM also records frequent misspelled terms (e.g. *kinasa instead of cinasa, ‘kinase’), and MedLexSp also includes some of these misspelled forms. From 40 076 concept entries, we included 30 733 entries (76.7%) that were mapped to UMLS CUIs automatically or manually.

3

The International Classification of Diseases vs. 10 (ICD-10) [34]: the WHO maintains this standard terminology and classification system, which is available in 40 languages for clinical diagnose and epidemiology. Terms are grouped in subdomains according to the system/organ class (e.g. respiratory system disorders), and the 10th version is currently the most implemented. A subset of terms from the International Classification of Diseases for Oncology (ICD-O-3) was also collected. Terms from both classifications enable an extensive coverage of standard disease-related terms.

4

The International Classification of Primary Care (ICPC) [35]: this is a taxonomy of terms, ranged in 17 chapters related to disorders according to body systems (e.g. digestive, circulatory or neurological conditions, among others). This resource ensures a wide coverage of terms related to primary care.

5

The Spanish version of the Diagnostic and Statistical Manual of Mental Disorders, 5th ed (DSM-5\(^\)) [32]: terms were mapped from the English codes in the UMLS using CUIs. This subset of terms in the lexicon covers an adequate range of mental disorders and psychiatric conditions.

6

The Medical Dictionary for Regulatory Activities (MedDRA) [29]: this classification and coding system is aimed at pharmacovigilance. The domain of MedDRA includes signs and symptoms, disorders and diagnostics, tests, labs and procedures, and social or medical history. It is available in 14 languages, and the Spanish translation was used. Thus, MedLexSp includes terms for most adverse events of pharmaceutical drugs. The subset of terms from MedDRA cannot be not distributed publicly owing to use restrictions.

7

The Medical Subject Headings (MeSH) [30]: the National Library of Medicine (NLM) maintains and updates this thesaurus with the purpose of indexing and classifying the biomedical literature. Available in several languages, the BIREME is responsible for the Spanish translation. Term classes range from anatomy and diseases to chemicals and drugs or analytical, diagnostic and therapeutic techniques, among others. This guarantees a wide coverage of medical subdomains using a terminological standard. MeSH terms were incorporated by means of a license agreement with BIREME.

8

The National Cancer Institute (NCI) Dictionary [52]: this is a comprehensive glossary of cancer-related terms (cancer types, therapeutic and diagnostic procedures, or chemotherapeutic drugs). There is a consumer-oriented version available online, so both technical and laymen terms were included.

9

OrphaData [53]: the Orphanet Rare Diseases Ontology (ORDO) is a controlled vocabulary and ontology for rare diseases, and a list of rare disorders mapped to reference terminologies. An XML file is available in several languages, including Spanish, and these data were processed to extract lists of terms and codes. We provide a companion script to extract the data (it could also be used for other languages: e.g. English, French, Italian or Portuguese). This resource provides an extensive coverage of rare diseases.

10

The Spanish Drug Effect database (SDEdb) [54]: this resource gathers terms related to adverse effects obtained from drug packages and medical web sites and social media. This database provides both new drug-related terms and laymen variants of technical words (e.g. deprimido, ‘depressed’, is more frequently used in consumer social media than depresión, ‘depression’).

11

The Nomenclator [55]: this is a rich database of drug brand names, generic compounds and international non-proprietary medication names prescribed in Spain. Data are available in several file formats, even an XML file.

12

The Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) [27]: a comprehensive nomenclature and ontology covering medical findings, procedures, body structures, pharmaceutical products and qualifiers. The College of American Pathologists developed it initially, and is currently supported by the International Health Terminology Standards Development Organisation (IHTSDO). It is one of the largest resources and the main clinical terminology for clinical coding worldwide. Because this is a resource with use restrictions, the subset of terms from SNOMED-CT is not shared.

13

The Online Mendelian Inheritance on Man (OMIM) [56]: the John Hopkins University maintains this large catalog of genes and genetic diseases resource. We mapped OMIM data from English terms in the UMLS (using CUIs) and codes from OrphaData. Since OMIM combines genetic data and descriptions of genetic disorders, the fact of including OMIM terms enriches MedLexSp with these types of information.

14

The WHO Adverse Drug Reactions (WHO-ART) terminology [28]: this dictionary was compiled for pharmacovigilance and is available in several languages. We used the Spanish translation in MedLexSp to include more than 2800 terms related to adverse events.

Terms from domain corpora

First, we extracted terms from 306 Summaries of Product Characteristics (SPCs) in the EasyDLP corpus [57], and from the Spanish versions of MedlinePlus [18] (for consumer health terms of disorders and lab tests). Using these corpora, most drug names and pharmacological substances are represented in MedLexSp.

Second, we used a domain corpus (+4M tokens) [58] to compute frequencies of the terms from MeSH and SNOMED-CT. Because these thesauri are too large, this strategy was applied to add a subset of terms that could be revised in a reasonable time and manner. Namely, a total of 48 188 term entries from SNOMED-CT were revised, and 20 649 term entries from MeSH.

Third, we added missing entities that were annotated in recent medical corpora; some of these resources have being used in competitions or shared tasks:

1

The Pharmacological Substances, Compounds and proteins Named Entity Recognition (PharmaCoNER) corpus [21]: this dataset gathers 1000 texts annotated with drug entities and proteins, which were normalized to SNOMED-CT [27] codes. Adding these entities to MedLexSp ensures a large coverage of terms related to pharmacological and biochemical substances.

2

The Clinical Case Coding in Spanish (CODIESP) corpus [19]: 1000 clinical cases published in scientific literature that were employed in a shared task for coding disorders using the International Classification of Diseases vs. 10 (ICD-10). By incorporating terms from this dataset, most disorders and conditions considered in the ICD-10 classification were added to MedLexSp.

3

The CANcer TExt Mining Shared Task (CANTEMIST) corpus [20]: 3000 annotated clinical cases about cancer used in a shared task for named entity recognition, normalization and coding of tumor morphology and codes of the International Classification of Oncology Diseases (ICD-O). With this dataset, MedLexSp provides a large typology and coverage of oncological diseases.

4

The Chilean Waiting List Corpus [22]: a collection of medical referrals annotated with semantic entities ranging from disorders, findings, drugs or procedures. The first version of the corpus was used (900 referrals).

5

The Clinical Trials for Evidence-based Medicine in Spanish (CT-EBM-SP) corpus [23]: this is a collection of 1200 texts related to clinical trial studies published in journals from the SciELO repository [59] and clinical trial announcements from EudraCT [60]. This dataset was employed as use case, where MedLexSp was applied to pre-annotate the data with UMLS semantic groups from the health domain, before manual revision (see Use cases section). The CT-EBM-SP resource is normalized to UMLS CUIs, so the inclusion of variant terms into the lexicon was easier. With this corpus, terms related to experimental drugs, interventions and clinical trial methods are represented in MedLexSp.

For the selected terms, we added UMLS CUIs, semantic types and groups, and PoS and morphological data (see Acquiring morphological data of terms section).

Combining a similarity measure and word embeddings

To incorporate new terms related to the COVID-19 pandemic, we tested a complementary approach to state-of-the-art rule-based techniques [61]. We employed a similar method to that applied for terminology expansion using patient blogs and electronic health records [62,63,64]. The experiment was based on: 1) A set of 20 seed words related to the COVID-19 pandemic; 2) An unsupervised approach combining a word embedding model and a similarity metric (the cosine value) to obtain semantically similar new words; 3) A collection of texts (+6M tokens) about the COVID-19 pandemics; and 4) A word embedding model trained on a collection of texts related to the pandemic topic. With this method, the coverage of MedLexSp was expanded with terms not available in the lexicon, but evidenced in a corpus.

As seed words, we used the following terms related to COVID-19: arbidol, camrelizumab, COVID-19, coronavirus, confinamiento (‘lockdown’), cuarentena (‘quarantine’), colchicina (‘colchicine’), danoprevir, EPI (‘Individual Protection Equipment’), EPP (‘Personal Protective Equipment’), hidroxicloroquina (‘hydroxychloroquine’), favipiravir, FFP2, leronlimab, N95, opaganib, remdesivir, SARS-CoV-2, umifenovir, and Wuhan. These terms were selected from COVID-19 glossaries available online [65], or appeared frequently in news media or scientific publications.

The unsupervised approach used the nearest neighbors algorithm by computing semantic similarity values. This similarity was measured by obtaining the word vectors of each seed term and token in several word embedding models, and calculating the cosine similarity (CS) value between vectors:

$$\begin similarity = \cos ( },})= \frac } \cdot } } }\Vert \cdot \Vert } \Vert } = \frac^ }_i \cdot }_i} }^ }_i)^2}} \cdot \sqrt^ }_i)^2}} } \end$$

where \(\vec\) is the vector of the seed term and \(\vec\) is the vector of a word in an embedding model. A cosine similarity of 1 indicates that token and term are identical, whereas a value of 0 means that the vectors are completely dissimilar—and, consequently, their meanings. The 50 candidate words with the highest CS values were retrieved for each term. The following is an example for the seed term remdesivir (only showing the first 10 nearest neighbors):

remdesevir     0.8997

veklury     0.7677

veklury®     0.7516

antiviral     0.72

acalabrutinib     0.7145

oseltamivir     0.7143

baricitinib     0.6989

darunavir     0.6949

tofacitinib     0.693

fármaco     0.6855 (‘medical drug’)

The example shows that the nearest neighbors are spelling variants (remdesevir), the brand name of the drug (veklury®), the name of the drug class (antiviral) or other antiviral agents (oseltamivir, darunavir). Note that a depth of 10 nearest neighbors was also tested, but the coverage of new terms was not satisfactory, since most of the 10 nearest neighbors obtained were misspellings or tokenization errors. The procedure involved looking up each out-of-vocabulary nearest neighbor—i. e. tokens not recorded in MedLexSp—by means of a python script, and checking manually whether the candidate new words were registered in the UMLS.

The word embedding models used to compute the word vectors were tested according to the different hyperparameters and configurations that yielded better results in terms of recall. First, we tested already-available word embedding models, namely the Spanish Biomedical and Clinical Word Embeddings in fastText [66]. These were trained on a large corpus exceeding 900M tokens, covering resources such as Wikipedia, the SciELO text corpus, texts from EMEA and the Spanish Register of Clinical Trials (REEC), and also a small proportion of COVID-19 clinical cases. We applied different pretrained model variants of 10, 100 and 300 dimensions (cased and uncased), and both architectures featured in fastText [2] (SkipGram and CBOW).

Despite the large volume of data used to train those embeddings, the quality of the nearest neighbors gathered was not satisfactory. Different studies have previously shown that a larger volume of data does not always yield the best results [67,68,69]. For example, the authors of [67] compared systematically general and domain-specific word embeddings for clinical and biomedical information extraction tasks. They did not found a correlation in performance between general and medical or clinical embeddings. Nevertheless, they did observe that word embeddings trained on text sources from local, smaller corpora yielded better results for local or ad hoc tasks. Likewise, the authors of [68] compared fastText and ELMo embeddings [3] trained on general domain texts and on specialized data for text classification and natural language understanding tasks. Their results were less conclusive: embeddings trained on a larger general corpora only yielded higher scores in the text classification task; but in the NLU task, the best results were obtained with embeddings trained with smaller data (but domain-specific, i.e. electronic health records). Another research team [69] compared public available pretrained language models and word vectors for a named entity recognition task (they used several biomedical and general datasets). Their outcomes tend to support that word vectors and language models trained on smaller sources (but with similar content and vocabulary to the target task) achieve comparable or higher scores than models trained on larger sources. The impact of corpus size or general versus domain-specific training texts is an aspect that needs further research.

Our approach for this task followed the assumption that models trained on smaller corpora, with texts more related to our task, would perform better. This is the reason why Spanish texts related to the COVID-19 pandemic were crawled from the Web to train word embeddings. Crawled web sites correspond to repositories of scientific or medical articles (Cochrane, PubMed) or health and research institutions (public information available in the Spanish National Research Council, the Spanish Ministry of Health, several regional health administrations, and in different National Institutes of Health (NIH), such as the National Cancer Institute). Other crawled sites were government drug agencies such as the Spanish Agency of Medicines and Medical Devices, the European Medicines Agency or the Food and Drug Administration. Information from independent agencies or journals was also crawled (e.g. Agencia SINC, The Conversation) in addition to data from Wikipedia. A list of text sources is provided in the companion GitHub repository. To select the sites, we searched on the Internet for COVID-related words and crawled sites ensuring quality content and created or supported by scientists or health experts. For PubMed, we used the following query: ((Spanish[Language]) AND (COVID-19[Title/Abstract])) AND (SARS-CoV-2[Title/Abstract]). The text collection exceeds 6M tokens, but we cannot redistribute it because some content is copyrighted. However, we release the trained embeddings and the source code to replicate our experiments.

Before training the models, texts were normalized (e.g. urls or non-utf-8 characters were removed) and white spaces were inserted between each token and punctuation sign (e.g. commas or dots). We used fastText [2] to train vectors of dimension 100 with SkipGram, and experimented with minimum term frequencies of 3 and 5.

Results of the semantic similarity approach using word embeddings

With this method, we gathered a total of 222 term entries (491 form variants corresponding to 158 unique CUIs). The best results were obtained with the word embeddings trained on COVID-19-related texts using the SkipGram configuration, 100 dimensions, a minimum token frequency of 3 and a window size of 5. Note that the recall of out-of-vocabulary items was rather large. Table 1 shows that the number of out-of-vocabulary (OOV) items was around 70% of the total nearest neighbors obtained (1000 items: 50 nearest neighbors for each of the 20 seed words). With the word embeddings trained on COVID-19 texts, OOVs ranged from 67.7% (model trained with minimum token frequency of 5) to 69.0% (model trained with minimum frequency of 3). However, most of the OOVs were spelling errors (e.g. covd-19), tokenization mistakes or words with hashtags (e.g. #virus). Many OOVs were ATC codes for drugs, pharmaceutical brand names, acronyms of health organizations and emojis (given that many texts come from the web). Only a small subset of OOVs were found in the UMLS and were assigned a CUI. With the best model configuration, a 11.3% of the OOVs could be matched to UMLS CUIs.

Table 1 Results of the nearest neighbors (NN) experiments with different word embedding models

As a qualitative analysis of the word embeddings used in the experiments, Fig. 3 shows the t-Distributed Stochastic Neighbor Embedding (t-SNE hereafter) [70] projection of the 100 most frequent words in the corpus. For this figure, we used a SkipGram word embedding model of 100 dimensions (minimum corpus frequency of 5). Stopwords (e.g. prepositions and articles) are not shown. In this figure, specific words related to findings, pathological conditions or body locations tend to appear in the middle to lower left region (marked in the blue area; e.g. infección, ‘infection’; COVID; neumonía, ‘pneumonia’; pulmonar, ‘pulmonary’; opacidades, ‘opacities’). Words related to drugs or procedures (e.g. vacunación, ‘vaccination’; vacuna, ‘vaccine’; dosis, ‘dosage’; medicamentos, ‘drugs’) are shown in the upper left region (marked in the red area). Lastly, words related to medical institutions, professionals or general care tend to occur in the upper region (marked in the green area; e.g. hospital, ‘hospital’; sanidad, ‘healthcare’; profesionales, ‘professionals’). Even though this is a shallow analysis (and only considers mono-word items), it shows that this unsupervised method can cluster words in semantically similar classes according to their position in the vector space.

Fig. 3figure 3

t-SNE visualization of the 100 most frequent words in the corpus

These data can be more clearly displayed in Figs. 4 and 5, which show the t-SNE visualization of the 10 most similar terms for the seed terms remdesivir and favipiravir, two antiviral agents that were tested to treat the COVID-19 infection. For this figures, the word embedding model used also features 100 dimensions and a minimum term frequency of 5 (SkipGram configuration).

Fig. 4figure 4

t-SNE visualization of the 10 most similar terms of the seed term fapiravir

Fig. 5figure 5

t-SNE visualization of the 10 most similar terms of the seed term remdesivir

Acquiring morphological data of terms

After collecting the terms and variants with the methods explained, the last stage involved enriching the lexicon with linguistic information. This morphological information can be used in NLP tasks such as part-of-speech tagging, lemmatization or natural language generation of medical texts. In addition to adding these types of data to mono-word terms, multi-word terms were also considered. In a similar manner to the Specialist Lexicon, multi-word terms were labeled with the category of the head word: e.g. enfermedad de Lyme (‘Lyme disease’) has label N (noun). Different approaches were applied to enrich the lexicon with the part-of-speech category of terms and morphological data of each variant form:

1

Terms registered in the Dictionary of medical terms [

留言 (0)

沒有登入
gif